Sameh Attia: Improving MetaData Performance of the Ext4 Journaling Device

There is always a relentless pursuit of more performance from our storage systems. This includes more performance from hardware (faster disks, SSD’s), network (bigger pipes, larger MTU’s), operating systems (caching, IO schedulers), and file systems.

There are many levers than can be moved to improve performance but this article will look at one particular piece - the file system journal device.

In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.

Journaling for File Systems

Sometimes bad things such as power failures happen to systems. Power interruptions or failures can cause a file system to become corrupt very quickly because an IO operation is interrupted and not completed.

Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems.

As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.

One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen.

Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.

If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning.

This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal.

Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs.

If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal.

As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed.

This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.

The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around.

But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems).

At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”).

Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.

Testing the Metadata Performance

In this article three options for the journal device will be tested to determine the impact of journal device location on the metadata performance of ext4.

The three device options are:

Journal on the same disk as the file system
Journal on a different disk from the file system
Journal on a ram disk

The last option, using a ramdisk for the journal, is designed to measure the pinnacle of performance. But it is not likely to be the most resilient solution (it would be better to use a battery backup of the ram disk with the ability to dump it to a storage device, drive or SSD).

However, it is included as an “upper bound” on performance.

One of the ways that journal performance can impact overall file system performance is in metadata performance.

This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems.

To read about fdtree and how it was used for benchmarking please see read the original article.

As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:

Directory creation
File creation
File removal
Directory Removal

It creates a specified number of files of a given size (in blocks) in a top-level directory. Then it creates a specified number of sub-directories and then in turn sub-directories are recursively created up to a specified number of levels and are populated with files.

Fdtree was used in 4 different approaches to stressing the metadata capability:

Small files (4 KiB)

Shallow directory structure
Deep directory structure

Larger files (4 MiB)

Shallow directory structure
Deep directory structure

The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used to get some feel for a range of performance as a function of the amount of data.

The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance.

The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.

The command lines for the four combinations are:
Small Files - Shallow Directory Structure

./fdtree.bash -d 20 -f 40 -s 1 -l 3

This command creates 20 sub-directories from each upper level directory at each level (”-d 20″) and there are 3 levels (”-l 3″). It’s a basic tree structure.

This is a total of 8,421 directories. In each directory there are 40 files (”-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.

Small Files - Deep Directory Structure

./fdtree.bash -d 3 -f 4 -s 1 -l 10

This command creates 3 sub-directories from each upper level directory at each level (”-d 3″) and there are 10 levels (”-l 10″).

This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.

Medium Files - Shallow Directory Structure

./fdtree.bash -d 17 -f 10 -s 1000 -l 2

This command creates 17 sub-directories from each upper level directory at each level (”-d 17″) and there are 2 levels (”-l 2″).

This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.

Medium Files - Deep Directory Structure

./fdtree.bash -d 2 -f 2 -s 1000 -l 10

This command creates 2 sub-directories from each upper level directory at each level (”-d 2″) and there are 10 levels (”-l 10″).

This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB).

This is a total of 4,094 files and 16,376,000 KiB total data.

Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9.

The tests were run on the following system:

GigaByte MAA78GM-US2H motherboard
An AMD Phenom II X4 920 CPU
8GB of memory
Linux 2.6.30 kernel
The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
/home is on a Seagate ST1360827AS
There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.

Only the first Seagate drive was used, /dev/sdb for all of the tests. The second drive, /dev/sdc was used only for the second test where the journal was placed on a second drive.

Journaling Device Details

All three journal device options used the same size journal file, 16MB. The reason that this size is used is that CentOS boots with a number of ramdisks already created.

However, these devices are limited to 16MB in size.

To make any comparisons fair the size of the journal was kept constant for all three cases.

The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (/dev/sdb1) and the remaining approximately 16MB of the drive was used for the journal (/dev/sdb2).

The first step was to build the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
29548544 inodes, 118180156 blocks
5909007 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3607 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to prepare the journal partition for journaling. Recall that the second partition on the drive (/dev/sdb2) is used for this.

# mke2fs -O journal_dev /dev/sdb2
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6024 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to tell the file system that it no longer has a journal in the file system (this is a precursor to telling it that the journal is located somewhere else).

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:10:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal.

The last step is to tell the file system that it has a journal and it is on the second partition of the drive.

# tune2fs -o journal_data -j -J device=/dev/sdb2 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdb2: done
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          99486587-5d38-4896-bf0a-ec79f9ac1d88
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              29548544
Block count:              118180156
Reserved block count:     5909007
Free blocks:              116307702
Free inodes:              29548533
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      995
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Mon Dec  7 11:07:20 2009
Last mount time:          n/a
Last write time:          Mon Dec  7 11:11:12 2009
Mount count:              0
Maximum mount count:      36
Last checked:             Mon Dec  7 11:07:20 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Jun  5 12:07:20 2010
Lifetime writes:          7350 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             b71b315f-40e8-4e93-b868-7ad19f7fee8b
Journal device:           0×0812
Default directory hash:   half_md4
Directory Hash Seed:      ed707821-9ec0-44c7-9c4a-15812b753939
Journal backup:           inode blocks

Notice that the line “Filesystem features” has the value “has_journal” and that the line “Journal device:” has a value 0×0812 that is pointing to the second partition on the drive.

The second journal device option where the journal is placed on a second hard drive is created using several steps.

The first step is to create the file system on /dev/sdb1.

# mke2fs -t ext4 /dev/sdb1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
30531584 inodes, 122096000 blocks
6104800 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
3727 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.

The second step is to create a journal on the second drive /dev/sdc1. This partition was create to be 16MB in size.

# mke2fs -O journal_dev /dev/sdc1
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 6016 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to then use tune2fs to tell the file system that it doesn’t have a journal on /dev/sdb1.

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:26:36 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system.

The final steps is to tell the file system that it has a journal that is on a specific device - in this case /dev/sdc1.

# tune2fs -o journal_data -j -J device=/dev/sdc1 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/sdc1: done
This filesystem will be automatically checked every 28 mounts or

# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          14a11690-76a6-4a3d-997a-abf85bd4d4ad
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sun Dec  6 07:22:57 2009
Last mount time:          n/a
Last write time:          Sun Dec  6 07:27:20 2009
Mount count:              0
Maximum mount count:      28
Last checked:             Sun Dec  6 07:22:57 2009
Check interval:           15552000 (6 months)
Next check after:         Fri Jun  4 08:22:57 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             c3d3c7e7-f465-41c7-a556-80a9cdc865c3
Journal device:           0×0821
Default directory hash:   half_md4
Directory Hash Seed:      7d24bc9d-db4a-4c0c-b15d-f0959af6edde
Journal backup:           inode blocks

Looking through the listing you can see that the file system has a journal again (”has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.

The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive.

But recall that the external journal has to be a block device.

The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.

The first step is to use examine what ramdisks are already created.

# ls -lsa /dev/ram*
0 lrwxrwxrwx 1 root root     4 Dec  6 17:27 /dev/ram -> ram1
0 brw-r----- 1 root disk 1,  0 Dec  6 17:27 /dev/ram0
0 brw-r----- 1 root disk 1,  1 Dec  6 17:27 /dev/ram1
0 brw-r----- 1 root disk 1, 10 Dec  6 17:27 /dev/ram10
0 brw-r----- 1 root disk 1, 11 Dec  6 17:27 /dev/ram11
0 brw-r----- 1 root disk 1, 12 Dec  6 17:27 /dev/ram12
0 brw-r----- 1 root disk 1, 13 Dec  6 17:27 /dev/ram13
0 brw-r----- 1 root disk 1, 14 Dec  6 17:27 /dev/ram14
0 brw-r----- 1 root disk 1, 15 Dec  6 17:27 /dev/ram15
0 brw-r----- 1 root disk 1,  2 Dec  6 17:27 /dev/ram2
0 brw-r----- 1 root disk 1,  3 Dec  6 17:27 /dev/ram3
0 brw-r----- 1 root disk 1,  4 Dec  6 17:27 /dev/ram4
0 brw-r----- 1 root disk 1,  5 Dec  6 17:27 /dev/ram5
0 brw-r----- 1 root disk 1,  6 Dec  6 17:27 /dev/ram6
0 brw-r----- 1 root disk 1,  7 Dec  6 17:27 /dev/ram7
0 brw-r----- 1 root disk 1,  8 Dec  6 17:27 /dev/ram8
0 brw-r----- 1 root disk 1,  9 Dec  6 17:27 /dev/ram9
0 lrwxrwxrwx 1 root root     4 Dec  6 17:27 /dev/ramdisk -> ram0

For this simple example, the ramdisk, /dev/ram0 was used. The first step is to expand it to the maximum size, which is 16MB, without rebooting the kernel using the “dd” command.

# dd if=/dev/zero of=/dev/ram0 bs=1k count=16000
16000+0 records in
16000+0 records out
16384000 bytes (16 MB) copied, 0.0411906 seconds, 398 MB/s

The second step is to create an external journal on the expanded ramdisk.

# mke2fs -O journal_dev /dev/ram0
mke2fs 1.41.9 (22-Aug-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
0 inodes, 4096 blocks
0 blocks (0.00%) reserved for the super user
First data block=0
0 block group
32768 blocks per group, 32768 fragments per group
0 inodes per group
Superblock backups stored on blocks:

Zeroing journal device: done

The third step is to tell the file system that it does not have a journal.

# tune2fs -O ^has_journal /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)

The final step is to then tell the file system that it has an external journal on a specific device.

# tune2fs -o journal_data -j -J device=/dev/ram0 /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Creating journal on device /dev/ram0: done
This filesystem will be automatically checked every 31 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
# tune2fs -l /dev/sdb1
tune2fs 1.41.9 (22-Aug-2009)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          7438d86f-7e12-4208-ad52-36de72591e0a
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    journal_data
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              30531584
Block count:              122096000
Reserved block count:     6104800
Free blocks:              120161866
Free inodes:              30531573
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      994
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
Flex block group size:    16
Filesystem created:       Sat Dec  5 20:15:20 2009
Last mount time:          n/a
Last write time:          Sat Dec  5 20:35:12 2009
Mount count:              0
Maximum mount count:      31
Last checked:             Sat Dec  5 20:15:20 2009
Check interval:           15552000 (6 months)
Next check after:         Thu Jun  3 21:15:20 2010
Lifetime writes:          7590 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal UUID:             d19989da-109e-4fbc-abc5-dc42ce5da249
Journal device:           0×0100
Default directory hash:   half_md4
Directory Hash Seed:      278035d2-49a3-474c-bb13-5174d44fec51
Journal backup:           inode blocks

For all three journal device options the file system is mounted with the option “data=ordered” option.

Benchmark Results

The first combination tested was for small files (4 KiB) with a shallow directory structure. Table 1 below lists the results with an average value and just below it, in red, is the standard deviation.

Table 1 - Benchmark Times Small Files (4 KiB) - Shallow Directory Structure

Journal Location	Directory Create (secs.)	File Create (secs.)	File Remove (secs.)	Directory Remove (secs.)
Same Disk Journal	31.10 0.83	355.70 5.39	76.70 0.90	6.40 0.92
Second Disk Journal	28.40 1.28	346.70 2.53	70.90 0.94	6.80 3.89
Ramdisk Journal	26.30 0.46	351.11 3.33	70.50 1.02	15.70 0.64

The first test, directory creates, had an average run time of approximately 30 seconds for all three device journals, so the results may not be that meaningful. In addition, the directory remove test ran in less than 10 seconds.

Consequently, this test may not have much value.

Table 2 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 2 - Performance Results of Small Files (4 KiB) - Shallow Directory Structure

Journal Location	Directory Create (Dirs/sec)	File Create (Files/sec)	File Create (KiB/sec)	File Remove (Files/sec)	Directory Remove (Dirs/sec)
Same Disk Journal	270.60 7.32	946.70 14.59	3,788.30 58.72	4,391.90 51.82	1,353.20 266.08
Second Disk Journal	296.50 13.30	971.10 7.06	3,885.90 28.73	4,751.50 63.66	1,529.50 513.69
Ramdisk Journal	319.40 5.50	959.00 9.25	3,827.20 36.44	4,778.60 69.29	536.90 21.62

The second combination tested was for small files (4 KiB) with a deep directory structure. Table 3 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 3 - Benchmark Times Small Files (4 KiB) - Deep Directory Structure

Journal Location	Directory Create (secs.)	File Create (secs.)	File Remove (secs.)	Directory Remove (secs.)
Same Disk Journal	335.90 8.93	627.60 10.36	343.30 6.78	202.00 3.58
Second Disk Journal	324.50 3.17	633.30 7.09	330.60 2.15	214.40 1.36
Ramdisk Journal	312.40 3.56	624.80 4.66	333.00 3.07	253.00 25.78

All four tests were longer than 60 seconds so they are valid for examination.

Table 4 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 4 - Performance Results of Small Files (4 KiB) - Deep Directory Structure

Journal Location	Directory Create (Dirs/sec)	File Create (Files/sec)	File Create (KiB/sec)	File Remove (Files/sec)	Directory Remove (Dirs/sec)
Same Disk Journal	263.40 7.05	564.20 9.20	2,258.10 37.07	1,031.90 20.40	438.10 7.63
Second Disk Journal	272.50 2.80	559.00 6.36	2,237.70 25.28	1,071.30 6.96	412.40 2.42
Ramdisk Journal	282.80 2.99	566.60 4.25	2,267.80 17.08	1,063.60 9.62	362.60 3.95

The third combination tested was for medium files (4 MiB) with a shallow directory structure. Table 5 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 5 - Benchmark Times Medium Files (4 MiB) - Shallow Directory Structure

Journal Location	Directory Create (secs.)	File Create (secs.)	File Remove (secs.)	Directory Remove (secs.)
Same Disk Journal	0.40 0.49	155.80 3.25	13.20 3.25	0.00 0.00
Second Disk Journal	0.20 0.40	154.40 3.41	12.20 3.49	0.10 0.30
Ramdisk Journal	0.40 0.49	153.40 2.06	13.20 3.25	0.10 0.30

For these tests, the first test, directory creates, took less than 1 second. This time is very small and, consequently, the results are not as applicable as some of the other tests. The file removes test took about 10-15 seconds.

Again this is a very short time and the results may not be as applicable. The last test, directory removes, took 0-1.4 seconds. This time too, is very short.

Table 6 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 6 - Performance Results of Medium Files (4 MiB) - Shallow Directory Structure

Journal Location	Directory Create (Dirs/sec)	File Create (Files/sec)	File Create (KiB/sec)	File Remove (Files/sec)	Directory Remove (Dirs/sec)
Same Disk Journal	122.80 150.40	19.30 0.46	79,344.30 1,133.38	250.40 69.83	0.00 0.00
Second Disk Journal	61.40 122.80	19.40 0.66	79,570.40 1,682.43	271.60 72.43	30.70 92.10
Ramdisk Journal	122.80 150.40	19.80 0.40	80,065.70 1,047.94	252.30 84.27	30.70 92.10

The fourth and final combination tested was for medium files (4 MiB) with a deep directory structure. Table 7 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.

Table 7 - Benchmark Times Medium Files (4 MiB) - Deep Directory Structure

Journal Location	Directory Create (secs.)	File Create (secs.)	File Remove (secs.)	Directory Remove (secs.)
Same Disk Journal	4.20 0.60	228.30 1.35	16.30 3.47	2.30 0.78
Second Disk Journal	4.20 0.60	225.90 1.58	15.30 2.69	1.50 0.50
Ramdisk Journal	5.50 0.50	225.90 1.51	14.90 3.86	2.40 0.49

The first test, directory creates, took 2-3 seconds, which is very short. The time for the third test, file removal, was also fairly short at 11-19 seconds.

The last test, directory removes, was extremely fast at less than 2 seconds. These three results are somewhat suspect because of short run time.

Table 8 below lists the performance results with an average value and just below it, in red, is the standard deviation.

Table 8 - Results of Medium Files (4 MiB) - Deep Directory Structure

Journal Location	Directory Create (Dirs/sec)	File Create (Files/sec)	File Create (KiB/sec)	File Remove (Files/sec)	Directory Remove (Dirs/sec)
Same Disk Journal	497.50 76.57	17.40 0.49	71,731.90 422.40	265.30 69.85	1,006.00 392.48
Second Disk Journal	497.50 76.57	17.80 0.40	72,495.30 507.54	225.90 1.58	1,535.00 512.00
Ramdisk Journal	375.00 34.00	18.00 0.45	72,495.00 485.26	225.90 1.51	886.60 167.06

Benchmark Observations

The first thing you should check when examining the results is the time to complete the test.

If the test does not run longer than 60 second than the test is suspect because not enough time has been allowed for meaningful results.

After that then you can contrast or compare the three journal device options.

The first test, shallow directory structure and small files (Tables 1 and 2), did not have run times greater than 60 seconds except for file create and file remove.

If we examine the results for these two tests, the following observations can be made.

File Creation:
- Putting the journal on a second disk is slightly faster than having it on the same disk.
- Putting the journal on the ramdisk improved metadata performance in comparison to having it on the same drive. However it was just a very tiny bit slower than putting the journal on a second drive.
File Removal:

Putting the journal on the same disk is about 10% slower than putting it either on a second disk or a ramdisk.

The second test, small files, deep directory, produced longer run times for the various metadata tests. All tests ran longer than 60 seconds allowing all data to be contrasted or compared.

Tables 3 and 4 are used to compared results for the three journal devices:

Directory Creation:
- Putting the journal on a second disk is faster than putting the journal on the same disk.
- Putting the journal on a ramdisk is even faster than putting it on a second disk. It is about 10% faster than the journal on a single disk.
File Creation:

All three journal device options produce about the same results

File Removal:

All three journal device options produce about the same results

Directory Removal:

Unexpectedly, putting the journal on the same disk is faster than putting it on a second disk with the same disk option being about 5% faster.
Perhaps even more unexpectedly, putting the journal on a ramdisk is slower than putting it on a second disk. More over, the ramdisk journal is approximately 14% slower than having the journal on a single disk.

The third case, medium files (4 MiB) and a shallow directory structure only had one test that ran longer than 60 seconds, file create.

Tables 5 and 6 contain the results for this test. Comparing the results for the three journal locations results in the following observations:

File Create:
- All three journal device options produce about the same results

The last test is for medium files for a deep directory structure. Tables 7 and 8 contain the times and results for this test for all three journal devices.

As with the medium files, shallow directory test (the previous test), only one test, file creation, that ran longer than 60 seconds.

Comparing results for the three devices has the following observations.

File Creation:
- Putting the journal on a second hard drive or a ramdisk produced slightly faster results than putting it on the same disk.

Summary

The journal is an important aspect of a file system from a data integrity perspective and also a performance perspective.

Many file systems in Linux allow you to put the journal on a different device. This flexibility gives the opportunity to use various block devices to improve performance.

This article examines three options for placing the journal for ext4: (1)On the same disk as the file system, (2)On a second drive, and (3)On a ramdisk.

To contrast the three options metadata tests using the benchmark fdtree were run. This metadata benchmark is easy to run (only requiring bash) and has been used before in metadata testing.

The results are a bit mixed with no clear journal location as the winner. One would have expected the ramdisk to produce the fastest file system performance in terms of metadata performance but only in one or two instances was the ramdisk faster than the other two options.

In a larger set of cases, using a second hard drive was found to be just as fast or faster than using a ramdisk.

The reasons for the ramdisk not producing faster results is not known at this time. Further testing will have to be performed, but there is some speculation that the size of the journal played a role.

If you compare the results in this article to the results in a previous article one would see that the results here are much slower.

It is presumed that is is because the size of the journal was artificially constrained to be 16 MB. Future testing will focus on determining if this is the cause.

Friday, January 1, 2010

Improving MetaData Performance of the Ext4 Journaling Device

No comments:

Post a Comment

Sameh Attia

Followers

About Me

Blog Archive