There is always a relentless pursuit of more performance from our storage systems. This includes more performance from hardware (faster disks, SSD’s), network (bigger pipes, larger MTU’s), operating systems (caching, IO schedulers), and file systems.
There are many levers than can be moved to improve performance but this article will look at one particular piece - the file system journal device.
In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.
Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems.
As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.
One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen.
Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.
If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning.
This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal.
Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs.
If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal.
As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed.
This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.
The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around.
But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems).
At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”).
Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.
The three device options are:
However, it is included as an “upper bound” on performance.
One of the ways that journal performance can impact overall file system performance is in metadata performance.
This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems.
To read about fdtree and how it was used for benchmarking please see read the original article.
As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:
Fdtree was used in 4 different approaches to stressing the metadata capability:
The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance.
The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.
The command lines for the four combinations are:
Small Files - Shallow Directory Structure
This command creates 20 sub-directories from each upper level directory at each level (”-d 20″) and there are 3 levels (”-l 3″). It’s a basic tree structure.
This is a total of 8,421 directories. In each directory there are 40 files (”-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.
Small Files - Deep Directory Structure
This command creates 3 sub-directories from each upper level directory at each level (”-d 3″) and there are 10 levels (”-l 10″).
This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.
Medium Files - Shallow Directory Structure
This command creates 17 sub-directories from each upper level directory at each level (”-d 17″) and there are 2 levels (”-l 2″).
This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.
Medium Files - Deep Directory Structure
This command creates 2 sub-directories from each upper level directory at each level (”-d 2″) and there are 10 levels (”-l 10″).
This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB).
This is a total of 4,094 files and 16,376,000 KiB total data.
Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9.
The tests were run on the following system:
However, these devices are limited to 16MB in size.
To make any comparisons fair the size of the journal was kept constant for all three cases.
The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (
The first step was to build the file system on
The second step is to prepare the journal partition for journaling. Recall that the second partition on the drive (
The third step is to tell the file system that it no longer has a journal in the file system (this is a precursor to telling it that the journal is located somewhere else).
Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal.
The last step is to tell the file system that it has a journal and it is on the second partition of the drive.
Notice that the line “Filesystem features” has the value “has_journal” and that the line “Journal device:” has a value 0×0812 that is pointing to the second partition on the drive.
The second journal device option where the journal is placed on a second hard drive is created using several steps.
The first step is to create the file system on
The second step is to create a journal on the second drive
The third step is to then use tune2fs to tell the file system that it doesn’t have a journal on
Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system.
The final steps is to tell the file system that it has a journal that is on a specific device - in this case
Looking through the listing you can see that the file system has a journal again (”has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.
The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive.
But recall that the external journal has to be a block device.
The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.
The first step is to use examine what ramdisks are already created.
For this simple example, the ramdisk,
The second step is to create an external journal on the expanded ramdisk.
The third step is to tell the file system that it does not have a journal.
The final step is to then tell the file system that it has an external journal on a specific device.
For all three journal device options the file system is mounted with the option “data=ordered” option.
The first combination tested was for small files (4 KiB) with a shallow directory structure. Table 1 below lists the results with an average value and just below it, in red, is the standard deviation.
Table 1 - Benchmark Times Small Files (4 KiB) - Shallow Directory Structure
The first test, directory creates, had an average run time of approximately 30 seconds for all three device journals, so the results may not be that meaningful. In addition, the directory remove test ran in less than 10 seconds.
Consequently, this test may not have much value.
Table 2 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 2 - Performance Results of Small Files (4 KiB) - Shallow Directory Structure
The second combination tested was for small files (4 KiB) with a deep directory structure. Table 3 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 3 - Benchmark Times Small Files (4 KiB) - Deep Directory Structure
All four tests were longer than 60 seconds so they are valid for examination.
Table 4 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 4 - Performance Results of Small Files (4 KiB) - Deep Directory Structure
The third combination tested was for medium files (4 MiB) with a shallow directory structure. Table 5 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 5 - Benchmark Times Medium Files (4 MiB) - Shallow Directory Structure
For these tests, the first test, directory creates, took less than 1 second. This time is very small and, consequently, the results are not as applicable as some of the other tests. The file removes test took about 10-15 seconds.
Again this is a very short time and the results may not be as applicable. The last test, directory removes, took 0-1.4 seconds. This time too, is very short.
Table 6 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 6 - Performance Results of Medium Files (4 MiB) - Shallow Directory Structure
The fourth and final combination tested was for medium files (4 MiB) with a deep directory structure. Table 7 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 7 - Benchmark Times Medium Files (4 MiB) - Deep Directory Structure
The first test, directory creates, took 2-3 seconds, which is very short. The time for the third test, file removal, was also fairly short at 11-19 seconds.
The last test, directory removes, was extremely fast at less than 2 seconds. These three results are somewhat suspect because of short run time.
Table 8 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 8 - Results of Medium Files (4 MiB) - Deep Directory Structure
If the test does not run longer than 60 second than the test is suspect because not enough time has been allowed for meaningful results.
After that then you can contrast or compare the three journal device options.
The first test, shallow directory structure and small files (Tables 1 and 2), did not have run times greater than 60 seconds except for file create and file remove.
If we examine the results for these two tests, the following observations can be made.
Tables 3 and 4 are used to compared results for the three journal devices:
Tables 5 and 6 contain the results for this test. Comparing the results for the three journal locations results in the following observations:
As with the medium files, shallow directory test (the previous test), only one test, file creation, that ran longer than 60 seconds.
Comparing results for the three devices has the following observations.
Many file systems in Linux allow you to put the journal on a different device. This flexibility gives the opportunity to use various block devices to improve performance.
This article examines three options for placing the journal for ext4: (1)On the same disk as the file system, (2)On a second drive, and (3)On a ramdisk.
To contrast the three options metadata tests using the benchmark fdtree were run. This metadata benchmark is easy to run (only requiring bash) and has been used before in metadata testing.
The results are a bit mixed with no clear journal location as the winner. One would have expected the ramdisk to produce the fastest file system performance in terms of metadata performance but only in one or two instances was the ramdisk faster than the other two options.
In a larger set of cases, using a second hard drive was found to be just as fast or faster than using a ramdisk.
The reasons for the ramdisk not producing faster results is not known at this time. Further testing will have to be performed, but there is some speculation that the size of the journal played a role.
If you compare the results in this article to the results in a previous article one would see that the results here are much slower.
It is presumed that is is because the size of the journal was artificially constrained to be 16 MB. Future testing will focus on determining if this is the cause.
There are many levers than can be moved to improve performance but this article will look at one particular piece - the file system journal device.
In particular, the metadata performance of ext4 will be considered as the journal is moved to different devices.
Journaling for File Systems
Sometimes bad things such as power failures happen to systems. Power interruptions or failures can cause a file system to become corrupt very quickly because an IO operation is interrupted and not completed.Consequently, the file system has to be checked (fsck) which means the entire file system has to be checked (walked) to find and correct any problems.
As file systems grew the amount of time it takes to walk the file system greatly increased. For example, the author remembers performing an fsck on a 1TB file system in 2002-2003 that took several days. Having the system down for this amount of time is very painful.
One way to help improve fsck times is to use a journaled file system. Rather than IO operations happening directly to the file system, the operations are added to the journal (typically a log) in the order they are supposed to happen.
Then the file system grabs the operation from the head of the journal and completes it, erasing the operation from the journal only after the operation is finished and the file system is satisfied that the operation is complete.
If the power is lost during the operation on a journaled file system, when the system comes back up, the journal is just “replayed,” i.e. the operations in the journal are performed one at a time starting at the beginning.
This means that the entire file system doesn’t necessarily have to be checked (walked). The primary reason this can be done is that the interruption happens before the operation is removed from the journal.
Even if the operation wasn’t completed on the file system, replaying the operation ensures that the IO operation actually occurs.
If the interruption happened while the operation was being deleted from the journal, the file system can assume that the operation happened and it just deletes the “corrupted” operation from the head of the journal.
As a result, you should not have to walk the entire file system to repair problems. Only the journal needs to be replayed.
This means that instead of spending a couple of days waiting for an fsck to finish, a very fast replay of the journal is performed taking just minutes.
The journal can theoretically reside anywhere within the system on any device. It can be on the drive containing the file system or it can use a partition on another drive or any other block device you have laying around.
But choosing the “best device” is important. The journal is very important to the integrity of the file system so making sure that the journal is on a device of some resiliency is very important (resiliency in this case means the ability to tolerate errors or problems).
At the same time, everyone loves more performance (there is likely no one who has said, “you know, I want my storage to go slower.”).
Since the performance of the journal can be key to the performance of the file system, perhaps improving the performance of the journaling device and the journal itself can help overall file system performance.
Testing the Metadata Performance
In this article three options for the journal device will be tested to determine the impact of journal device location on the metadata performance of ext4.The three device options are:
- Journal on the same disk as the file system
- Journal on a different disk from the file system
- Journal on a ram disk
However, it is included as an “upper bound” on performance.
One of the ways that journal performance can impact overall file system performance is in metadata performance.
This article will focus on metadata performance as measured by fdtree. This benchmark has been used before to examine the metadata performance of various Linux file systems.
To read about fdtree and how it was used for benchmarking please see read the original article.
As a quick recap, the benchmark, fdtree, is a simple bash script that performs four different metadata tests:
- Directory creation
- File creation
- File removal
- Directory Removal
Fdtree was used in 4 different approaches to stressing the metadata capability:
- Small files (4 KiB)
- Shallow directory structure
- Deep directory structure
- Larger files (4 MiB)
- Shallow directory structure
- Deep directory structure
The two directory structures were used to stress the metadata in different ways to discover if there is any impact on the metadata performance.
The shallow directory structure means that there are many directories but not very many levels down. The deep directory structure means that there are not many directories at a particular level but that there are many levels.
The command lines for the four combinations are:
Small Files - Shallow Directory Structure
./fdtree.bash -d 20 -f 40 -s 1 -l 3
This command creates 20 sub-directories from each upper level directory at each level (”-d 20″) and there are 3 levels (”-l 3″). It’s a basic tree structure.
This is a total of 8,421 directories. In each directory there are 40 files (”-f 40″) each sized at 1 block (4 KiB) denoted by “-s 1″. This is a total of 336,840 files and 1,347,360 KiB total data.
Small Files - Deep Directory Structure
./fdtree.bash -d 3 -f 4 -s 1 -l 10
This command creates 3 sub-directories from each upper level directory at each level (”-d 3″) and there are 10 levels (”-l 10″).
This is a total of 88,573 directories. In each directory there are 4 files each sized at 1 block (4 KiB). This is a total of 354,292 files and 1,417,168 KiB total data.
Medium Files - Shallow Directory Structure
./fdtree.bash -d 17 -f 10 -s 1000 -l 2
This command creates 17 sub-directories from each upper level directory at each level (”-d 17″) and there are 2 levels (”-l 2″).
This is a total of 307 directories. In each directory there are 10 files each sized at 1,000 blocks (4 MiB). This is a total of 3,070 files and 12,280,000 KiB total data.
Medium Files - Deep Directory Structure
./fdtree.bash -d 2 -f 2 -s 1000 -l 10
This command creates 2 sub-directories from each upper level directory at each level (”-d 2″) and there are 10 levels (”-l 10″).
This is a total of 2,047 directories. In each directory there are 2 files each sized at 1,000 blocks (4 MiB).
This is a total of 4,094 files and 16,376,000 KiB total data.
Each test was run 10 times with the four combinations for the three journal devices. The test system used for these tests was a stock CentOS 5.3 distribution but with a 2.6.30 kernel and e2fsprogs was upgraded to 1.41.9.
The tests were run on the following system:
- GigaByte MAA78GM-US2H motherboard
- An AMD Phenom II X4 920 CPU
- 8GB of memory
- Linux 2.6.30 kernel
- The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta ATA/100)
- /home is on a Seagate ST1360827AS
- There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are
/dev/sdb
and/dev/sdc
.
/dev/sdb
for all of the tests. The second drive, /dev/sdc
was used only for the second test where the journal was placed on a second drive.Journaling Device Details
All three journal device options used the same size journal file, 16MB. The reason that this size is used is that CentOS boots with a number of ramdisks already created.However, these devices are limited to 16MB in size.
To make any comparisons fair the size of the journal was kept constant for all three cases.
The first journal device option was to keep the journal on the same disk as the file system. The drive was partitioned so that the first partition was used for the file system itself (
/dev/sdb1
) and the remaining approximately 16MB of the drive was used for the journal (/dev/sdb2
).The first step was to build the file system on
/dev/sdb1
.# mke2fs -t ext4 /dev/sdb1 mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 29548544 inodes, 118180156 blocks 5909007 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 3607 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 28 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
The second step is to prepare the journal partition for journaling. Recall that the second partition on the drive (
/dev/sdb2
) is used for this.# mke2fs -O journal_dev /dev/sdb2 mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 0 inodes, 6024 blocks 0 blocks (0.00%) reserved for the super user First data block=0 0 block group 32768 blocks per group, 32768 fragments per group 0 inodes per group Superblock backups stored on blocks: Zeroing journal device: done
The third step is to tell the file system that it no longer has a journal in the file system (this is a precursor to telling it that the journal is located somewhere else).
# tune2fs -O ^has_journal /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) # tune2fs -l /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Filesystem volume name:Last mounted on: Filesystem UUID: 99486587-5d38-4896-bf0a-ec79f9ac1d88 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 29548544 Block count: 118180156 Reserved block count: 5909007 Free blocks: 116307702 Free inodes: 29548533 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 995 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Mon Dec 7 11:07:20 2009 Last mount time: n/a Last write time: Mon Dec 7 11:10:12 2009 Mount count: 0 Maximum mount count: 36 Last checked: Mon Dec 7 11:07:20 2009 Check interval: 15552000 (6 months) Next check after: Sat Jun 5 12:07:20 2010 Lifetime writes: 7350 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Default directory hash: half_md4 Directory Hash Seed: ed707821-9ec0-44c7-9c4a-15812b753939 Journal backup: inode blocks
Notice that the line “Filesystem features” does not have the entry “has_journal” indicating that the file system no longer has a journal.
The last step is to tell the file system that it has a journal and it is on the second partition of the drive.
# tune2fs -o journal_data -j -J device=/dev/sdb2 /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Creating journal on device /dev/sdb2: done This filesystem will be automatically checked every 36 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # tune2fs -l /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Filesystem volume name:Last mounted on: Filesystem UUID: 99486587-5d38-4896-bf0a-ec79f9ac1d88 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: journal_data Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 29548544 Block count: 118180156 Reserved block count: 5909007 Free blocks: 116307702 Free inodes: 29548533 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 995 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Mon Dec 7 11:07:20 2009 Last mount time: n/a Last write time: Mon Dec 7 11:11:12 2009 Mount count: 0 Maximum mount count: 36 Last checked: Mon Dec 7 11:07:20 2009 Check interval: 15552000 (6 months) Next check after: Sat Jun 5 12:07:20 2010 Lifetime writes: 7350 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal UUID: b71b315f-40e8-4e93-b868-7ad19f7fee8b Journal device: 0×0812 Default directory hash: half_md4 Directory Hash Seed: ed707821-9ec0-44c7-9c4a-15812b753939 Journal backup: inode blocks
Notice that the line “Filesystem features” has the value “has_journal” and that the line “Journal device:” has a value 0×0812 that is pointing to the second partition on the drive.
The second journal device option where the journal is placed on a second hard drive is created using several steps.
The first step is to create the file system on
/dev/sdb1
.# mke2fs -t ext4 /dev/sdb1 mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 30531584 inodes, 122096000 blocks 6104800 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 3727 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 28 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override.
The second step is to create a journal on the second drive
/dev/sdc1
. This partition was create to be 16MB in size.# mke2fs -O journal_dev /dev/sdc1 mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 0 inodes, 6016 blocks 0 blocks (0.00%) reserved for the super user First data block=0 0 block group 32768 blocks per group, 32768 fragments per group 0 inodes per group Superblock backups stored on blocks: Zeroing journal device: done
The third step is to then use tune2fs to tell the file system that it doesn’t have a journal on
/dev/sdb1
.# tune2fs -O ^has_journal /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) # tune2fs -l /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Filesystem volume name:Last mounted on: Filesystem UUID: 14a11690-76a6-4a3d-997a-abf85bd4d4ad Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30531584 Block count: 122096000 Reserved block count: 6104800 Free blocks: 120161866 Free inodes: 30531573 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Sun Dec 6 07:22:57 2009 Last mount time: n/a Last write time: Sun Dec 6 07:26:36 2009 Mount count: 0 Maximum mount count: 28 Last checked: Sun Dec 6 07:22:57 2009 Check interval: 15552000 (6 months) Next check after: Fri Jun 4 08:22:57 2010 Lifetime writes: 7590 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Default directory hash: half_md4 Directory Hash Seed: 7d24bc9d-db4a-4c0c-b15d-f0959af6edde Journal backup: inode blocks
Notice on the line “Filesystem features” that the features “has_journal” is not listed. This indicates that the journal has been “removed” from the file system.
The final steps is to tell the file system that it has a journal that is on a specific device - in this case
/dev/sdc1
.# tune2fs -o journal_data -j -J device=/dev/sdc1 /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Creating journal on device /dev/sdc1: done This filesystem will be automatically checked every 28 mounts or # tune2fs -l /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Filesystem volume name:Last mounted on: Filesystem UUID: 14a11690-76a6-4a3d-997a-abf85bd4d4ad Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: journal_data Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30531584 Block count: 122096000 Reserved block count: 6104800 Free blocks: 120161866 Free inodes: 30531573 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Sun Dec 6 07:22:57 2009 Last mount time: n/a Last write time: Sun Dec 6 07:27:20 2009 Mount count: 0 Maximum mount count: 28 Last checked: Sun Dec 6 07:22:57 2009 Check interval: 15552000 (6 months) Next check after: Fri Jun 4 08:22:57 2010 Lifetime writes: 7590 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal UUID: c3d3c7e7-f465-41c7-a556-80a9cdc865c3 Journal device: 0×0821 Default directory hash: half_md4 Directory Hash Seed: 7d24bc9d-db4a-4c0c-b15d-f0959af6edde Journal backup: inode blocks
Looking through the listing you can see that the file system has a journal again (”has_journal” on the line “Filesystem features”) and that the journal device is listed as “0×0821″ near the bottom of the listing.
The third journal device option is to place it on a ram drive. It is done in a similar fashion to the previous option where the journal was put on a second drive.
But recall that the external journal has to be a block device.
The technique used for a ramdisk block device is fairly simple and is based on this article. Despite the article being based on a 2.4 kernel, the techniques are the same.
The first step is to use examine what ramdisks are already created.
# ls -lsa /dev/ram* 0 lrwxrwxrwx 1 root root 4 Dec 6 17:27 /dev/ram -> ram1 0 brw-r----- 1 root disk 1, 0 Dec 6 17:27 /dev/ram0 0 brw-r----- 1 root disk 1, 1 Dec 6 17:27 /dev/ram1 0 brw-r----- 1 root disk 1, 10 Dec 6 17:27 /dev/ram10 0 brw-r----- 1 root disk 1, 11 Dec 6 17:27 /dev/ram11 0 brw-r----- 1 root disk 1, 12 Dec 6 17:27 /dev/ram12 0 brw-r----- 1 root disk 1, 13 Dec 6 17:27 /dev/ram13 0 brw-r----- 1 root disk 1, 14 Dec 6 17:27 /dev/ram14 0 brw-r----- 1 root disk 1, 15 Dec 6 17:27 /dev/ram15 0 brw-r----- 1 root disk 1, 2 Dec 6 17:27 /dev/ram2 0 brw-r----- 1 root disk 1, 3 Dec 6 17:27 /dev/ram3 0 brw-r----- 1 root disk 1, 4 Dec 6 17:27 /dev/ram4 0 brw-r----- 1 root disk 1, 5 Dec 6 17:27 /dev/ram5 0 brw-r----- 1 root disk 1, 6 Dec 6 17:27 /dev/ram6 0 brw-r----- 1 root disk 1, 7 Dec 6 17:27 /dev/ram7 0 brw-r----- 1 root disk 1, 8 Dec 6 17:27 /dev/ram8 0 brw-r----- 1 root disk 1, 9 Dec 6 17:27 /dev/ram9 0 lrwxrwxrwx 1 root root 4 Dec 6 17:27 /dev/ramdisk -> ram0
For this simple example, the ramdisk,
/dev/ram0
was used. The first step is to expand it to the maximum size, which is 16MB, without rebooting the kernel using the “dd” command.# dd if=/dev/zero of=/dev/ram0 bs=1k count=16000 16000+0 records in 16000+0 records out 16384000 bytes (16 MB) copied, 0.0411906 seconds, 398 MB/s
The second step is to create an external journal on the expanded ramdisk.
# mke2fs -O journal_dev /dev/ram0 mke2fs 1.41.9 (22-Aug-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 0 inodes, 4096 blocks 0 blocks (0.00%) reserved for the super user First data block=0 0 block group 32768 blocks per group, 32768 fragments per group 0 inodes per group Superblock backups stored on blocks: Zeroing journal device: done
The third step is to tell the file system that it does not have a journal.
# tune2fs -O ^has_journal /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009)
The final step is to then tell the file system that it has an external journal on a specific device.
# tune2fs -o journal_data -j -J device=/dev/ram0 /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Creating journal on device /dev/ram0: done This filesystem will be automatically checked every 31 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. # tune2fs -l /dev/sdb1 tune2fs 1.41.9 (22-Aug-2009) Filesystem volume name:Last mounted on: Filesystem UUID: 7438d86f-7e12-4208-ad52-36de72591e0a Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: journal_data Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 30531584 Block count: 122096000 Reserved block count: 6104800 Free blocks: 120161866 Free inodes: 30531573 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 994 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Sat Dec 5 20:15:20 2009 Last mount time: n/a Last write time: Sat Dec 5 20:35:12 2009 Mount count: 0 Maximum mount count: 31 Last checked: Sat Dec 5 20:15:20 2009 Check interval: 15552000 (6 months) Next check after: Thu Jun 3 21:15:20 2010 Lifetime writes: 7590 MB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal UUID: d19989da-109e-4fbc-abc5-dc42ce5da249 Journal device: 0×0100 Default directory hash: half_md4 Directory Hash Seed: 278035d2-49a3-474c-bb13-5174d44fec51 Journal backup: inode blocks
For all three journal device options the file system is mounted with the option “data=ordered” option.
Benchmark Results
The first combination tested was for small files (4 KiB) with a shallow directory structure. Table 1 below lists the results with an average value and just below it, in red, is the standard deviation.
Table 1 - Benchmark Times Small Files (4 KiB) - Shallow Directory Structure
Journal Location | Directory Create (secs.) | File Create (secs.) | File Remove (secs.) | Directory Remove (secs.) |
---|---|---|---|---|
Same Disk Journal | 31.10 0.83 | 355.70 5.39 | 76.70 0.90 | 6.40 0.92 |
Second Disk Journal | 28.40 1.28 | 346.70 2.53 | 70.90 0.94 | 6.80 3.89 |
Ramdisk Journal | 26.30 0.46 | 351.11 3.33 | 70.50 1.02 | 15.70 0.64 |
Consequently, this test may not have much value.
Table 2 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 2 - Performance Results of Small Files (4 KiB) - Shallow Directory Structure
Journal Location | Directory Create (Dirs/sec) | File Create (Files/sec) | File Create (KiB/sec) | File Remove (Files/sec) | Directory Remove (Dirs/sec) |
---|---|---|---|---|---|
Same Disk Journal | 270.60 7.32 | 946.70 14.59 | 3,788.30 58.72 | 4,391.90 51.82 | 1,353.20 266.08 |
Second Disk Journal | 296.50 13.30 | 971.10 7.06 | 3,885.90 28.73 | 4,751.50 63.66 | 1,529.50 513.69 |
Ramdisk Journal | 319.40 5.50 | 959.00 9.25 | 3,827.20 36.44 | 4,778.60 69.29 | 536.90 21.62 |
The second combination tested was for small files (4 KiB) with a deep directory structure. Table 3 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 3 - Benchmark Times Small Files (4 KiB) - Deep Directory Structure
Journal Location | Directory Create (secs.) | File Create (secs.) | File Remove (secs.) | Directory Remove (secs.) |
---|---|---|---|---|
Same Disk Journal | 335.90 8.93 | 627.60 10.36 | 343.30 6.78 | 202.00 3.58 |
Second Disk Journal | 324.50 3.17 | 633.30 7.09 | 330.60 2.15 | 214.40 1.36 |
Ramdisk Journal | 312.40 3.56 | 624.80 4.66 | 333.00 3.07 | 253.00 25.78 |
All four tests were longer than 60 seconds so they are valid for examination.
Table 4 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 4 - Performance Results of Small Files (4 KiB) - Deep Directory Structure
Journal Location | Directory Create (Dirs/sec) | File Create (Files/sec) | File Create (KiB/sec) | File Remove (Files/sec) | Directory Remove (Dirs/sec) |
---|---|---|---|---|---|
Same Disk Journal | 263.40 7.05 | 564.20 9.20 | 2,258.10 37.07 | 1,031.90 20.40 | 438.10 7.63 |
Second Disk Journal | 272.50 2.80 | 559.00 6.36 | 2,237.70 25.28 | 1,071.30 6.96 | 412.40 2.42 |
Ramdisk Journal | 282.80 2.99 | 566.60 4.25 | 2,267.80 17.08 | 1,063.60 9.62 | 362.60 3.95 |
The third combination tested was for medium files (4 MiB) with a shallow directory structure. Table 5 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 5 - Benchmark Times Medium Files (4 MiB) - Shallow Directory Structure
Journal Location | Directory Create (secs.) | File Create (secs.) | File Remove (secs.) | Directory Remove (secs.) |
---|---|---|---|---|
Same Disk Journal | 0.40 0.49 | 155.80 3.25 | 13.20 3.25 | 0.00 0.00 |
Second Disk Journal | 0.20 0.40 | 154.40 3.41 | 12.20 3.49 | 0.10 0.30 |
Ramdisk Journal | 0.40 0.49 | 153.40 2.06 | 13.20 3.25 | 0.10 0.30 |
For these tests, the first test, directory creates, took less than 1 second. This time is very small and, consequently, the results are not as applicable as some of the other tests. The file removes test took about 10-15 seconds.
Again this is a very short time and the results may not be as applicable. The last test, directory removes, took 0-1.4 seconds. This time too, is very short.
Table 6 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 6 - Performance Results of Medium Files (4 MiB) - Shallow Directory Structure
Journal Location | Directory Create (Dirs/sec) | File Create (Files/sec) | File Create (KiB/sec) | File Remove (Files/sec) | Directory Remove (Dirs/sec) |
---|---|---|---|---|---|
Same Disk Journal | 122.80 150.40 | 19.30 0.46 | 79,344.30 1,133.38 | 250.40 69.83 | 0.00 0.00 |
Second Disk Journal | 61.40 122.80 | 19.40 0.66 | 79,570.40 1,682.43 | 271.60 72.43 | 30.70 92.10 |
Ramdisk Journal | 122.80 150.40 | 19.80 0.40 | 80,065.70 1,047.94 | 252.30 84.27 | 30.70 92.10 |
The fourth and final combination tested was for medium files (4 MiB) with a deep directory structure. Table 7 below lists the benchmark times with an average value and just below it, in red, is the standard deviation.
Table 7 - Benchmark Times Medium Files (4 MiB) - Deep Directory Structure
Journal Location | Directory Create (secs.) | File Create (secs.) | File Remove (secs.) | Directory Remove (secs.) |
---|---|---|---|---|
Same Disk Journal | 4.20 0.60 | 228.30 1.35 | 16.30 3.47 | 2.30 0.78 |
Second Disk Journal | 4.20 0.60 | 225.90 1.58 | 15.30 2.69 | 1.50 0.50 |
Ramdisk Journal | 5.50 0.50 | 225.90 1.51 | 14.90 3.86 | 2.40 0.49 |
The first test, directory creates, took 2-3 seconds, which is very short. The time for the third test, file removal, was also fairly short at 11-19 seconds.
The last test, directory removes, was extremely fast at less than 2 seconds. These three results are somewhat suspect because of short run time.
Table 8 below lists the performance results with an average value and just below it, in red, is the standard deviation.
Table 8 - Results of Medium Files (4 MiB) - Deep Directory Structure
Journal Location | Directory Create (Dirs/sec) | File Create (Files/sec) | File Create (KiB/sec) | File Remove (Files/sec) | Directory Remove (Dirs/sec) |
---|---|---|---|---|---|
Same Disk Journal | 497.50 76.57 | 17.40 0.49 | 71,731.90 422.40 | 265.30 69.85 | 1,006.00 392.48 |
Second Disk Journal | 497.50 76.57 | 17.80 0.40 | 72,495.30 507.54 | 225.90 1.58 | 1,535.00 512.00 |
Ramdisk Journal | 375.00 34.00 | 18.00 0.45 | 72,495.00 485.26 | 225.90 1.51 | 886.60 167.06 |
Benchmark Observations
The first thing you should check when examining the results is the time to complete the test.If the test does not run longer than 60 second than the test is suspect because not enough time has been allowed for meaningful results.
After that then you can contrast or compare the three journal device options.
The first test, shallow directory structure and small files (Tables 1 and 2), did not have run times greater than 60 seconds except for file create and file remove.
If we examine the results for these two tests, the following observations can be made.
- File Creation:
- Putting the journal on a second disk is slightly faster than having it on the same disk.
- Putting the journal on the ramdisk improved metadata performance in comparison to having it on the same drive. However it was just a very tiny bit slower than putting the journal on a second drive.
- File Removal:
- Putting the journal on the same disk is about 10% slower than putting it either on a second disk or a ramdisk.
Tables 3 and 4 are used to compared results for the three journal devices:
- Directory Creation:
- Putting the journal on a second disk is faster than putting the journal on the same disk.
- Putting the journal on a ramdisk is even faster than putting it on a second disk. It is about 10% faster than the journal on a single disk.
- File Creation:
- All three journal device options produce about the same results
- File Removal:
- All three journal device options produce about the same results
- Directory Removal:
- Unexpectedly, putting the journal on the same disk is faster than putting it on a second disk with the same disk option being about 5% faster.
- Perhaps even more unexpectedly, putting the journal on a ramdisk is slower than putting it on a second disk. More over, the ramdisk journal is approximately 14% slower than having the journal on a single disk.
Tables 5 and 6 contain the results for this test. Comparing the results for the three journal locations results in the following observations:
- File Create:
- All three journal device options produce about the same results
As with the medium files, shallow directory test (the previous test), only one test, file creation, that ran longer than 60 seconds.
Comparing results for the three devices has the following observations.
- File Creation:
- Putting the journal on a second hard drive or a ramdisk produced slightly faster results than putting it on the same disk.
Summary
The journal is an important aspect of a file system from a data integrity perspective and also a performance perspective.Many file systems in Linux allow you to put the journal on a different device. This flexibility gives the opportunity to use various block devices to improve performance.
This article examines three options for placing the journal for ext4: (1)On the same disk as the file system, (2)On a second drive, and (3)On a ramdisk.
To contrast the three options metadata tests using the benchmark fdtree were run. This metadata benchmark is easy to run (only requiring bash) and has been used before in metadata testing.
The results are a bit mixed with no clear journal location as the winner. One would have expected the ramdisk to produce the fastest file system performance in terms of metadata performance but only in one or two instances was the ramdisk faster than the other two options.
In a larger set of cases, using a second hard drive was found to be just as fast or faster than using a ramdisk.
The reasons for the ramdisk not producing faster results is not known at this time. Further testing will have to be performed, but there is some speculation that the size of the journal played a role.
If you compare the results in this article to the results in a previous article one would see that the results here are much slower.
It is presumed that is is because the size of the journal was artificially constrained to be 16 MB. Future testing will focus on determining if this is the cause.
No comments:
Post a Comment