Sunday, February 6, 2011

Finding the Fastest Filesystem, 2011 Edition


In my previous report about journaling filesystem benchmarking using dbench, I observed that a properly-tuned system using XFS, with the deadline I/O scheduler, beat both Linux’s ext3 and IBM’s JFS.

A lot has changed in the three years since I posted that report, so it’s time to do a new round of tests. Many bug fixes, improved kernel lock management, and two new filesystem (btrfs and ext4) bring some new configurations to test.

Once again, I’ll provide raw numbers, but the emphasis of this report lies in the relative performance of the filesystems under various loads and configurations.

To this end, I have normalized the charted data, and eliminated the raw numbers on the Y-axes. Those who wish to run similar tests on their own systems can download a tarball containing the testing scripts; I’ll provide the link to the tarball at the end of this report.

System configuration
The test system is my desktop at home, an AMD Athlon 64 X2 dual-core 4800+ with 4 gigs of RAM and two SATA interfaces.

The drives use separate IRQ’s, with the journal on sda using IRQ 20, and the primary filesystem on sdb using IRQ 21.

The kernel under test is Linux 2.6.38-rc2 which now has built-in IRQ balancing between CPU’s/cores. The installed distribution is Slackware64-current.

During the tests, the system was in runlevel 1 (single-user), and I didn’t touch the keyboard anytime except during un-measured warm-ups.

The motherboard chipset supposedly supports Native Command Queuing, but the Linux kernel disables it due to hardware bugs.

Even with this limitation, “hdparm -Tt” reports about 950M/s for cached reads on both drives, and 65M/s buffered disk read for sda, 76M/s for sdb.

That raw throughput serves my usual desktop purposes well.

Filesystem options
I made a big improvement over hand-written notes, by formalizing and scripting the filesystem initialization and mounting options.

I also broadened the list of tested filesystems, adding ext2, ext4, ReiserFS, and btrfs.

All filesystems were mounted with at least “noatime,nodiratime” in the mount options; this is becoming standard practice for many Unix and Linux sites, where system administrators question the value of writing a new access time whenever a file is merely read.

A quick perusal of Documentation/filesystems/ in the kernel source tree, turned up a treasure trove of mount options, even for the experimental btrfs.

One unsafe option I added where possible, was to disable write barriers. Buffered writes can be the bane of journal integrity, so write barriers attempt to force the drive to write to the permanent storage sooner rather than later, at the cost of limiting the I/O elevator’s benefits. I opted for bandwidth in my short tests, for btrfs and ext4.

This filesystem format isn’t yet finalized, so it is completely unsuitable for storage of critical data. Still, it has been getting a lot of press coverage and online comment, with a big boost from Ted Ts’o, who called it “the future of Linux filesystems.” Strictly speaking, btrfs isn’t a filesystem with a journal.

It’s a log-structured filesystem, in which the journal is the filesystem. Btrfs supports RAID striping of data, metadata, or both, so I opted to enable RAID1 to distribute the I/O load:

mkfs.btrfs -d raid1 -m raid1 ${LOGDEV} ${PRIMARY}
(EDIT: I should have used RAID0 for striping. I will re-run the btfs tests and post the adjusted results.)

The btrfs mount options added “nobarrier,space_cache” for performance.

I added ext2, to provide a reference point based on highly stable code. It provided one of the early surprises in the tests.

mke2fs ${PRIMARY}

The default features enabled in /etc/mke2fs.conf were:

“sparse_super,filetype,resize_inode,dir_index,ext_attr”, with no mount options beyond “noatime,nodiratime”.

mke2fs -O journal_dev ${LOGDEV}
mke2fs -J device=${LOGDEV} ${PRIMARY}
The only addition to the base ext2 features is the journal. The mount options added for this test were “data=writeback,nobh,commit=30″.

The other new Linux filesystem is ext4, which adds several new features over ext2/3. The most notable feature replaces block maps with extents, which require less on-disk space for tracking the same amount of file data.

The ext4 journal also has stronger integrity checking than ext3 uses. (Another feature, not used in this test, is the ability to omit the journal from an ext4 filesystem. Combined with the efficiency of extents, this makes ext4 a strong candidate for flash storage, using fewer writes for the same amount of file data.)

mke2fs -O journal_dev ${LOGDEV}
mke2fs -E lazy_itable_init=0 -O extents -J device=${LOGDEV} ${PRIMARY}
The features from /etc/mke2fs.conf were:

“has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize”, but the “uninit_bg” feature was overridden by specifying “-E lazy_itable_init=0″ to mke2fs. This reduces extra background work during the dbench run.

Just as was the case three years ago, JFS still has no mkfs or mount options useful for testing throughput. WYSIAYG (What You See Is ALL You Get).

mkfs.jfs -q -j ${LOGDEV} ${PRIMARY}

I caught a lot of guff three years ago, for omitting ReiserFS from my testing. This time around, I decided that, if btrfs is good enough to test, even though it’s still in beta, then I should be fair to the ReiserFS community and include it as well.

Specifying “-f” twice skips the request for confirmation, useful for scripting.

mkreiserfs -f -f -j ${LOGDEV} ${PRIMARY}

Unfortunately, there is no file explaining ReiserFS options in Documentation/filesystems/, and the best advice in mount(8) uses weasel-words: “This [option] may provide performance improvements in some situations.”

Without an explanation of what situations would benefit from the various options, I saw no point in testing them.

Hence, the only non-default option in my ReiserFS testing is the external journal.

This was the hands-down winner in my previous testing. Designed with multi-threading and aggressive memory management, XFS can sustain heavy workloads of many different operations.

It has many tunable options for both mkfs.xfs and mount, so the scripted options are the most complicated:

mkfs.xfs -f -l logdev=${LOGDEV},size=256m,lazy-count=1 \
-d agcount=16 ${PRIMARY}

One shortcoming of XFS is its lack of a pointer to an external journal device. As far as I can tell, it is the only journaled filesystem on Linux to have only a flag specifying whether the journal is internal or external.

If the journal is external, then the mount command must include a valid “logdev=” option, or the mount will fail.

I also expanded the mounted journal buffers, with “logbufs=8,logbsize=262144″. On my computer, memory management is faster than disk I/O.

Testing the elevators
The original testing was intended to show the effects of disk I/O elevators and CPU speed on the various filesystems, using medium and heavy I/O load conditions.

Since I ran the original tests, the “anticipatory” I/O elevator has been dropped from the Linux kernel, leaving only “noop”, “deadline”, and “cfq”. This round of testing still shows significant differences between them.

With a 5-thread dbench load, I was surprised to see that ext2 was the consistent winner. Its lack of a journal makes for less overall disk I/O per operation, at the cost of a longer time to check the filesystem after an improper shutdown.

XFS came in a close second, at roughly 97% the performance of ext2.

The rest of the filesystems aren’t nearly as competitive. Even with their best elevators, JFS, ReiserFS, and btrfs have less than half the performance of ext2 or XFS.

When the load increases to 20 threads, XFS is once again the clear winner. Ext4 benefits in the overall ranking, thanks to extent-based allocation management, a trait it shares with XFS.

Ext2 falls to third place, probably due to the increased burden of managing block-based allocations. Ext3 again comes in fourth, with block-based allocations and added journal I/O.

The clear loser is once again JFS, coming in at only 40% under heavy load. (More on this later.)
Normalizing the throughput by a filesystem’s best elevator, shows which filesystems benefit from which elevators.

Oddly, under a 5-process load, the only filesystem to benefit from “cfq” is JFS on a fast CPU. As seen above, that isn’t enough to make it a strong contender against XFS or any of the native Linux ext{2,3,4} filesystems.
Here is where the game has changed. The “cfq” elevator clashed badly with XFS three years ago; it is now mostly on par with “deadline” and “noop”.

The XFS developers have put a lot of work into cleaning up the internals, improving the integration of XFS with the Linux frameworks for VFS and disk I/O. They still have work to do, as explained in Documentation/filesystems/xfs-delayed-logging-design.txt.

At its best, ReiserFS had only about 1/3 the throughput as the best filesystem, in any tested configuration.

Some mount options could probably improve the throughput, but without clear guidance, I wasn’t going to test every combination to find the best.

Bandwidth saturation
I decided to run a separate series of tests, to see what process loads would saturate the various filesystems, and how they would scale after passing those saturation points.

Using their best elevators, I tested the throughput of each filesystem under loads from 1 to 10 processes.

The two worst performers were JFS and ext2. JFS peaked at 3 processes, then dropped off badly, ending up at 33% of its best performance at 10 processes.

Ext2 didn’t suffer as badly, peaking at 5 processes, then falling only to 75% of its peak. Ext3, ext4, XFS, and ReiserFS didn’t suffer significantly under saturated load, staying mostly within a horizontal trend.

If I had to make a guess why JFS scales so poorly, I can only suppose that, following IBM’s philosophy, it’s better to be correct than to be fast.

A special surprise
Btrfs was something of a mystery, hitting a performance valley at 3 processes, then climbing steadily upward nearly to the end of the test.

Given that its raw number under 20 processes was better than its raw number under 10 processes, I decided to extend its test all the way to 50 processes, hoping to find its saturation point.

Btrfs managed to scale somewhat smoothly, all the way from 3 to 30 processes. Beyond that, its performance began to exhibit some noise, while still keeping an upward trend.

This is a very impressive development for a dual-core system. (For the math geeks, the trend line from 3 to 50 is f(x)=0.34x0.29, with coefficient of determination R2=0.99.)

The Linux filesystem landscape has changed a lot in the past three years, with two new contenders and lots of clean-up in the code. The conclusions of three years ago may not hold today.

Similarly, what’s true on my system, may not be the case on yours. If you wish to examine your own system’s behavior, this tarball (CAPTCHA and 30-second wait required) contains the scripts I used for this article, as well as two PDF’s with the raw and normalized results of my own testing.

No comments:

Post a Comment