Thursday, October 15, 2009

The Kernel Newbie Corner: "initrd" and "initramfs"--Some Unfinished Business

Since a few people seemed interested, I'm going to expand on last week's column on initramfs and initrd and summarize what we know so far, tie off a few loose ends and throw in a little more information at no extra charge.

So What Do We Know So Far?
Let's recap what we discussed last week, just to set the stage:
  • Both the initramfs and initrd features are examples of what are called "early userspace," which gives us the opportunity to create minimal root filesystems with enough kernel modules to allow the boot process to continue, to the point where the kernel can eventually mount the real root filesystem.
  • To clarify the terminology, I use the phrase "initramfs" to describe the root filesystem that is internal to the kernel, while "initrd" is used to refer to the external root filesystem represented by the "initrd" files in the /boot directory, and which is passed to the booting kernel by the bootloader.
  • The initrd files used to be a filesystem format so they needed to be mounted in order to examine their contents. These days, they're almost always gzipped cpio-format files so, with the right permissions, you can uncompress them and examine their contents more conveniently.
  • The (earlier) initramfs file is created during the kernel configuration and build process and can be found in the kernel source tree generated file usr/initramfs_data.cpio, so you're welcome to use cpio to examine the contents of that file as well.
  • If you choose not to define the exact contents of your initramfs file when configuring your kernel, it will be populated based on the script scripts/gen_initramfs_list.sh, where you can see your default initramfs will contain only two objects: the directory /root and the special device file /dev/console.
  • Finally, whether you even have kernel support for initrd and initramfs is based on whether you select the kernel config option BLK_DEV_INITRD, defined in the file init/Kconfig. Chances are, unless you're working in a restricted (perhaps embedded) environment, you'll have that support selected.
  • Much of the above is documented in the kernel source file Documentation/filesystems/ramfs-rootfs-initramfs.txt.

All of the above is just a recap from last week, so where do we go from here?

The Kernel Source Involved in Early Userspace

Again, as a bit of a recap from last week, let's review what part of the kernel source is involved in any use of early userspace and initramfs.

The top-level usr/ kernel source directory is what supports early userspace. If you choose to not select support for early userspace, that entire directory becomes redundant since nothing will include it.
And backing up one level, it's the top-level init directory that represents the early phase of the boot process and will, based on your kernel configuration, compile in the appropriate source files. You can see what will and what won't get included in the Makefile in that directory:
obj-y                          := main.o version.o mounts.o
ifneq ($(CONFIG_BLK_DEV_INITRD),y)
obj-y                          += noinitramfs.o
else
obj-$(CONFIG_BLK_DEV_INITRD)   += initramfs.o
endif
obj-$(CONFIG_GENERIC_CALIBRATE_DELAY) += calibrate.o

mounts-y                        := do_mounts.o
mounts-$(CONFIG_BLK_DEV_RAM)    += do_mounts_rd.o
mounts-$(CONFIG_BLK_DEV_INITRD) += do_mounts_initrd.o
mounts-$(CONFIG_BLK_DEV_MD)     += do_mounts_md.o

As you can see, depending on whether you select initrd/initramfs support, one of two source files will get compiled.

And, more specifically, depending on whether you select the kernel config option BLK_DEV_RAM, you'll get support for a filesystem-format initrd.

If you don't, you're restricted to a cpio-format initrd image, but it's typical to have that option configured in so it's unlikely that will ever be an issue.

And now, on to a few new concepts.

How Does the Kernel Learn About the initrd Image?
Ignoring the internal initramfs image for now, recall that the initrd image exists external to the kernel, and has to be passed to the kernel by the (GRUB?) bootloader.

And how does that work?

Assuming that we're working with the x86_64 architecture, consider the following snippets from the header file arch/x86/include/asm/bootparam.h, which show how the GRUB bootloader populates a well-known structure with the location of the initrd image you would have specified in the GRUB configuration file:
/* The so-called "zeropage" */
struct boot_params {
        ... snip ...
        struct setup_header hdr;    /* setup header */  /* 0x1f1 */
        ... snip ...
}
...
struct setup_header {
        ... snip ...
        __u32   ramdisk_image;    <-- there ...
        __u32   ramdisk_size;     <-- ... and there
        ... snip ...

And there you have it -- when the kernel goes looking for a possible initrd image, it's going to use an expression resembling something like bootparams.hdr.ramdisk_image with corresponding size bootparams.hdr.ramdisk_size. Or something like that.

And where should we find those tests? If you're feeling ambitious, you can check out the source files arch/x86/kernel/setup.c and arch/x86/kernel/head64.c. In that first file, you'll find the following self-explanatory routine:
static void __init reserve_initrd(void)
{
        u64 ramdisk_image = boot_params.hdr.ramdisk_image;
        u64 ramdisk_size  = boot_params.hdr.ramdisk_size;
        u64 ramdisk_end   = ramdisk_image + ramdisk_size;
        u64 end_of_lowmem = max_low_pfn_mapped << PAGE_SHIFT;

        if (!boot_params.hdr.type_of_loader ||
            !ramdisk_image || !ramdisk_size)
                return;         /* No initrd provided by bootloader */

        initrd_start = 0;

        if (ramdisk_size >= (end_of_lowmem>>1)) {
                free_early(ramdisk_image, ramdisk_end);
                printk(KERN_ERR "initrd too large to handle, "
                       "disabling initrd\n");
                return;
        }

        printk(KERN_INFO "RAMDISK: %08llx - %08llx\n", ramdisk_image,
                        ramdisk_end);

        if (ramdisk_end <= end_of_lowmem) {
                /* All in lowmem, easy case */
                /*
                 * don't need to reserve again, already reserved early
                 * in i386_start_kernel
                 */
                initrd_start = ramdisk_image + PAGE_OFFSET;
                initrd_end = initrd_start + ramdisk_size;
                return;
        }

        relocate_initrd();

        free_early(ramdisk_image, ramdisk_end);
}
#else
static void __init reserve_initrd(void)
{
}
#endif /* CONFIG_BLK_DEV_INITRD */

and in the second file, there's:
#ifdef CONFIG_BLK_DEV_INITRD
    /* Reserve INITRD */
    if (boot_params.hdr.type_of_loader && boot_params.hdr.ramdisk_image) {
        unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
        unsigned long ramdisk_size  = boot_params.hdr.ramdisk_size;
        unsigned long ramdisk_end   = ramdisk_image + ramdisk_size;
        reserve_early(ramdisk_image, ramdisk_end, "RAMDISK");
    }
#endif

You're not expected to understand exactly what's happening in the above, but it should be obvious that the above represents code that is checking for an initrd image that's been passed by the bootloader and, if one appears to be there, puts the appropriate address information into the variables initrd_start and initrd_end for later use. And you'll see shortly who's testing those variables.

And How Does the Kernel Locate the initramfs?
How the kernel keeps track of its own initramfs image is actually considerably simpler, since that image will be part of the kernel binary itself. You have a number of choices of which compression technique you want to use on your cpio-format initramfs image, as defined in the file usr/Kconfig:
config RD_GZIP
        bool "Support initial ramdisks compressed using gzip" if EMBEDDED
        default y
        depends on BLK_DEV_INITRD
        select DECOMPRESS_GZIP
        help
          Support loading of a gzip encoded initial ramdisk or cpio buffer.
          If unsure, say Y.

In other words, unless you explicitly choose a different compression technique, you'll get a gzipped initramfs image, which will be embedded in the final kernel image as defined by the section information in the file


usr/initramfs_data.gz.S:
.section .init.ramfs,"a"
.incbin "usr/initramfs_data.cpio.gz"

So, based on that linker information, a gzipped form of the initramfs file will be associated with the section name ".init.ramfs".

Whereupon the final piece of the puzzle can be found in the file include/asm-generic/vmlinux.lds.h, where you can read the snippet:

#ifdef CONFIG_BLK_DEV_INITRD
#define INIT_RAM_FS                                                     \
        . = ALIGN(PAGE_SIZE);                                           \
        VMLINUX_SYMBOL(__initramfs_start) = .;                          \
        *(.init.ramfs)                                                  \
        VMLINUX_SYMBOL(__initramfs_end) = .;
#else
#define INIT_RAM_FS
#endif

For readers unfamiliar with linker directives, the above defines the kernel symbols __initramfs_start and __initramfs_end to represent the addresses on either side of all sections with the name ".init.ramfs".

Which is precisely where our gzipped initramfs was defined to go. So we now have two kernel-space symbols that define the beginning and end of our embedded initramfs.

And what do we do with that? I'm glad you asked.

Early Userspace Processing
At this point, we can see how to find the potential initrd and initramfs images. So let's see where that processing occurs.

In the source file init/main.c, consider the very early boot code:
static int __init kernel_init(void * unused)
{
        lock_kernel();

        /*
         * init can allocate pages on any node
         */
        set_mems_allowed(node_possible_map);
        /*
         * init can run on any cpu.
         */
        set_cpus_allowed_ptr(current, cpu_all_mask);
        /*
         * Tell the world that we're going to be the grim
         * reaper of innocent orphaned children.
         *
         * We don't want people to have to make incorrect
         * assumptions about where in the task array this
         * can be found.
         */
        init_pid_ns.child_reaper = current;

        cad_pid = task_pid(current);

        smp_prepare_cpus(setup_max_cpus);

        do_pre_smp_initcalls();
        start_boot_trace();

        smp_init();
        sched_init_smp();

        do_basic_setup();     <-- there, that one

So let's follow the flow of control to do_basic_setup() in that same source file, where we read:
/*
 * Ok, the machine is now initialized. None of the devices
 * have been touched yet, but the CPU subsystem is up and
 * running, and memory and process management works.
 *
 * Now we can finally start doing some real work..
 */
static void __init do_basic_setup(void)
{
        rcu_init_sched(); /* needed by module_init stage. */
        init_workqueues();
        cpuset_init_smp();
        usermodehelper_init();
        init_tmpfs();
        driver_init();
        init_irq_proc();
        do_ctors();
        do_initcalls();    <-- that one
}

Let's keep going, where we find that the routine do_initcalls() is responsible for invoking a number of "initcalls" that have been registered all over the kernel, and which are run in a very specific order as defined in


include/linux/init.h:
#define pure_initcall(fn)               __define_initcall("0",fn,0)

#define core_initcall(fn)               __define_initcall("1",fn,1)
#define core_initcall_sync(fn)          __define_initcall("1s",fn,1s)
#define postcore_initcall(fn)           __define_initcall("2",fn,2)
#define postcore_initcall_sync(fn)      __define_initcall("2s",fn,2s)
#define arch_initcall(fn)               __define_initcall("3",fn,3)
#define arch_initcall_sync(fn)          __define_initcall("3s",fn,3s)
#define subsys_initcall(fn)             __define_initcall("4",fn,4)
#define subsys_initcall_sync(fn)        __define_initcall("4s",fn,4s)
#define fs_initcall(fn)                 __define_initcall("5",fn,5)
#define fs_initcall_sync(fn)            __define_initcall("5s",fn,5s)
#define rootfs_initcall(fn)             __define_initcall("rootfs",fn,rootfs)
#define device_initcall(fn)             __define_initcall("6",fn,6)
#define device_initcall_sync(fn)        __define_initcall("6s",fn,6s)
#define late_initcall(fn)               __define_initcall("7",fn,7)
#define late_initcall_sync(fn)          __define_initcall("7s",fn,7s)

Even if you're not familiar with kernel initcalls, it should be clear that the above defines the numerical order in which many, many initcalls are going to be run, and it's that one right in the middle that we care about:
#define rootfs_initcall(fn)    __define_initcall("rootfs",fn,rootfs)

From our perspective, there is only one routine that is defined as having that initcall level, and it's in the source file init/initramfs.c, and given that it's the critical part of this whole process, I'll include it in its entirety:
static int __init populate_rootfs(void)
{
    char *err = unpack_to_rootfs(__initramfs_start,
                     __initramfs_end - __initramfs_start);
    if (err)
        panic(err);     /* Failed to decompress INTERNAL initramfs */
    if (initrd_start) {
#ifdef CONFIG_BLK_DEV_RAM
        int fd;
        printk(KERN_INFO "Trying to unpack rootfs image as initramfs...\n");
        err = unpack_to_rootfs((char *)initrd_start,
            initrd_end - initrd_start);
        if (!err) {
            free_initrd();
            return 0;
        } else {
            clean_rootfs();
            unpack_to_rootfs(__initramfs_start,
                 __initramfs_end - __initramfs_start);
        }
        printk(KERN_INFO "rootfs image is not initramfs (%s)"
                        "; looks like an initrd\n", err);
        fd = sys_open("/initrd.image", O_WRONLY|O_CREAT, 0700);
        if (fd >= 0) {
            sys_write(fd, (char *)initrd_start,
                          initrd_end - initrd_start);
            sys_close(fd);
            free_initrd();
        }
#else
        printk(KERN_INFO "Unpacking initramfs...\n");
        err = unpack_to_rootfs((char *)initrd_start,
            initrd_end - initrd_start);
        if (err)
            printk(KERN_EMERG "Initramfs unpacking failed: %s\n", err);
        free_initrd();
#endif
    }
    return 0;
}
rootfs_initcall(populate_rootfs);

So we can finally see how the kernel tries to mount an early userspace root filesystem:
  • First, try to unpack (by calling the routine unpack_to_rootfs() the internal initramfs image. If that fails catastrophically, just panic and bail.
  • If you couldn't get the initramfs but you haven't failed catastrophically, give up on the initramfs and go after the external initrd as your second choice.
I could go on, but I presume you get the idea, and if I tried to walk through the rest of that code, we'd be here for a while. But let's complete the picture.

Recall, from back in init/main.c, that we got here from calling the routine do_basic_setup() from within the kernel_init() function. And when that flow of control returns, what happens? Oh, look:
do_basic_setup();

    /*
     * check if there is an early userspace init.  If yes, let it do all
     * the work
     */

     if (!ramdisk_execute_command)
         ramdisk_execute_command = "/init";

     if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
         ramdisk_execute_command = NULL;
         prepare_namespace();
     }

Note how, once we've fetched and mounted our early userspace filesystem, we can check if it contains the appropriate first executable file (by default, "/init" but selectable with the "init=" kernel command line option), at which point we simply run it.

Is There More?
Oh, yes, but you can see that just providing a basic overview of early userspace processing can get quite involved. Maybe I'll have to turn this into a complete tutorial some day. You know, in my copious free time.

No comments:

Post a Comment