Wednesday, March 14, 2012

Linux File System -- Analyzing the Fsck Test Results


he results of our Linux file system fsck testing are in and posted, but the big question remains: What do the results tell us, what do they mean, and is the performance expected? In this article we will take a look at the results, talk to some experts, and sift through the tea leaves for their significance.

Introduction

The Linux file system fsck test results article generated some comments and discussions that are addressed in this article. However, before we do so, let's review the reason for the testing and what we hoped to learn from it.
Almost a year ago, Henry Newman and I had a wonderful Cuban dinner and started talking about file systems and storage technology, particularly in Linux. We both want to see the Linux community succeed and thrive, but some of the signs of that happening were not very encouraging at the time. The officially supported file system limits from Red Hat were fairly small, with 100TB being the largest file system supported. We also talked about some of the possible issues and thought that one possible reason for the limitation was metadata scaling issues, particularly the amount of need time to complete a file system check (fsck).
Henry and I speculated that one possible reason for Red Hat imposing supported file system size limitations was because of the amount of time needed to perform an fsck. (Note: These are supported file system limitations, not theoretical capacities.) Consequently, we decided to do some testing on larger file systems, 50TB and 100TB, which are fairly large capacities, given the supported limits for a large number of files. Our initial goal was up to 1 billion files.
The original fsck test plan was to test both ext4 and xfs with a varying number of files from 10 million files to 100 million files and two capacities for each file system: 40TB and 80TB for xfs and 5TB and 10TB for ext4. The goal was to keep it below 16TB since that was the limitation when the first article was written.
The original source of hardware for testing could not give us access to test hardware due to various reasons, and it was many months before Data Direct Networks (DDN) provided extended access for testing (thanks very much, DDN!). The details of the fsck testing and the results are explained in great detail in the previous article. Some of the details of the testing were changed from the original plan due to changes in the hardware and changes in the software.
Just to reiterate, our goal was to really test the fsck rate of the file systems. We were very curious about how quickly we could perform a fsck. Consequently, we filled a file system using fs_mark with a specified number of directories (only one layer deep), a specified number of files, and a specified file size. This tool has been used in other fsck studies (see subsequent section). Then, the file systems were unmounted, and the respective fsck was run. Since there was no damage to the file system, it was expected that time to complete a file system check would be as short as possible (i.e., the fastest possible metadata rates).
The reason we chose to perform the fsck testing in this manner was that artificially damaging or fragmenting the file system in some manner is arbitrary. That is, the results would apply only to that specific details of the damage or fragmentation process. This would tell us very little about the metadata performance in an fsck context of these file systems for other cases. If you will, the testing would tell the best possible fsck time (shortest time).
Since interpreting these results had much to do with the inner workings of some of the fsck tools, I reached out to David Chinner, one of the lead developers for xfs. He also happens to be employed at Red Hat and is an all-round file system kernel guy. He seemed the ideal person to contact for help in interpreting the results.

Analysis of FSCK Results

The results presented in the previous article were just the raw results of how much time it took to complete the file system check. I will be examining the results in the previous article except for the case labeled "fragmented" because the results for this case looked strangely out of line with the rest of the results (an outlier). After discussing it with David Chinner, I decided to drop that case from further analysis.
One obvious question this data raises is, how many files per second did the file system check take? Table 1 reproduces the data from the fsck test results article, but beneath the fsck times the number of files-per-second touched during the fsck are shown in red. Recall that the testing used CentOS 5.7 and a 2.6.18-274.el5 kernel.
Table 1: FSCK times for the various file system sizes, number of files, and for xfs and ext4 file systems. The fsck rate is shown below the fsck times.
File System
Size (TB)
Number of Files
(Million of files)
XFS - xfs_repair
time (secs)
ext4 - fsck time
(secs)
72
105
1,629
(64,456.7 files/s)
3,193
(32,884.4 files/s)
72
51
534
(95,505.6 files/s)
1,811
(28,161.2 files/s)
72
10.2
161
(63,354.0 files/s)
972
(10,493.8 files/s)
38
105
710
(147,887.3 files/s)
3,372
(31,138.8 files/s)
38
51
266
(191,729.3 files/s)
1,358
(37,555.2 files/s)
38
10.2
131
(77,862.6 files/s)
470
(21,702.1 files/s)
72
415
11,324
(36,647.8 files/s)
NA
The fastest fsck rate is for the case with 51 million files and a 38TB xfs file system (191,729.3 files/s). The slowest rate is for the case of 10.2 million files and a 72TB ext4 file system (10,493.8 files/s).
In looking at the data, I have made some general observations about the results.
  • For these tests you can easily see an order of magnitude difference in the rate of files processed during the file system checked.
  • The fsck for ext4 is slower than for xfs.
  • In general, for this small number of tests, the rate of files processed during the fsck for ext4 improved as the number of files increased. For xfs, the trend is not as consistent, but overall, as the number of files increased, the rate of files processed during the fsck for xfs generally improved.
  • All of the file system checks finished in less than four hours (an unwritten goal of the original study).
During most of the fsck tests, the server did not swap. This was checked at various times during the fsck using "vmstat." However, for the case of 415 million files on the 72TB xfs file system, it does appear that the server did swap at some point, and the checks did miss the swapping. This is evident because of the large drop in fsck rate performance compared to the other cases.
Dave Chinner suggested another way to check for trend in the data -- looking at the file rate for the same inode count but different file system size for xfs. This data is shown in Table 2.
Table 2: Difference in File Processing Rate for XFS for the Two File System Sizes.
File System Size
inodes
38TB
72TB
diff
10.2M 77,862.6 63,354.0 -18.6%
51M 191,729.3 95,505.6 -50.2%
105M 147,887.3 64,456.7 -56.4%
Dave's comments about the results are that as the file system size was decreased by roughly 50 percent (from 72TB to about 38TB), the number of files processed is decreased by about 50 percent for larger number of files. The larger file systems has 50 percent more allocation groups (AG), which results in inodes being spread over a 50 percent larger physical area. This larger physical area containing inodes means that the average seek time to read the inodes increases. Hence, the processing rate goes down due to the longer IO latencies. That means the overall change in file rate isn't surprising. In David's words, "Large file systems mean more locations that spread across, which means more seeks to read them all ..."
The same data for ext4 is in Table 3 below:
Table 3: Difference in File Processing Rate for XFS for the Two File System Sizes.
File System Size
inodes
38TB
72TB
diff
10.2M 31,138.8 32,884.4 5.6%
10.2M 31,138.8 32,884.4 5.6%
51M 37,555.2 28,161.2 -25.0%
105M 31,318.8 32,884.4 5.0%
Notice that the difference in the file rate for the 72TB and 38TB file system sizes are roughly the same. That is because ext4 reallocates the inode space in known areas and should use them in exactly the same pattern with the higher regions not being used at all in any of the configurations because we used only 50 percent of the file system capacity, and there was no fragmentation.


Comparison to Other Tests

To understand if these rates for files processed during the fsck are comparable to other tests, Table 4 takes data from presentations by Ric Wheeler from Red Hat. Ric has done several presentations about putting 1 billion files in a file system with various performance tests. Table 4 contains a quick description of the test, including the hardware and distribution if available, the number of files in the fsck tests, and the fsck time for various file systems.
Table 4: FSCK times for the various file system sizes, number of files, and file systems from Ric Wheeler presentations.
Configuration
Size (TB)
Number of Files
(Million of files)
ext3 - fsck time
time (secs)
ext4 - fsck time
(secs)
XFS - xfs_repair
time (secs)
btrfs - fsck time
(secs)
1 SATA drive (2010).
Unknown OS and kernel
1
1,070
(934.6 files/s)
40
(25,000 files/s)
40
(25,000 files/s)
90
(11,111.1 files/s)
1 PCIe drive (2010).
Unknown OS and kernel
1
70
(14,285.7 files/s)
3
(333,333.3 files/s)
4
(250,000 files/s)
11
(90,909.1 files/s)
1 SATA drive (2011).
RHEL: 6.1 alpha,
2.6.38.3-18 kernel
1,000
(zero length)
NA
3,600
(277,777.8 files/s)
54,000
(18,518.5 files/s)
NA
16 TB, 12 SAS drives,
hardware RAID (2011).
RHEL: 6.1 alpha,
2.6.38.3-18 kernel
1,000
(zero length)
NA
5,400
(185,185.18 files/s)
33,120
(30,193.2 files/s)
NA
The rate of files processed in the file system check performance for xfs ranged from 18,518.5 files/s for a single drive with an alpha version of RHEL 6.1 (kernel was 2.6.38.3-18) with 1 billion zero-length files to 250,000 files/s for a single PCIe drive with an unknown distribution (presumably Fedora or Red Hat) and 1 million files. But the more comparable result is 30,193.2 files/s for the case with 12 SATA drives and hardware RAID controller with 1 billion zero-length files using an alpha version of RHEL 6.1 with a 2.6.38.3-18 kernel.
According to David Chinner, who did much of the above testing, the low file processing rates are a result of the limited memory in the host system. For the next to last case, there is only 2GB in the system, and for the last case there is only 8GB. In reviewing this article, David went on to say:
What you see here in these last two entries is the effect of having limited RAM to run xfs_repair. They are 2GB for the single SATA case, and 8GB for the SAS RAID case. In each case, there isn't enough RAM for xfs_repair to run in multi-threaded mode, so it is running in it's old slow, single threaded mode that doesn't do any prefetching at all. If it had 24GB RAM like the tests you've run, the performance would have been similar to what you have achieved.
Since both cases don't really have enough memory for multi-threading operations, xfs_repair resorted to a single threaded behavior. This limited performance. According to David, more memory really allows multi-threading operations and prefetching in xfs_repair, which greatly improves performance. If you want a fast repair, at least in the case of xfs, add more memory to the host node.
On the other hand, ext4 had a much higher rate of files processed during the fsck. The performance ranged from 25,000 files/s for a single drive with an unknown distribution (presumably Fedora or Red Hat) with 1 million files to 333,333.3 files/s for a single PCIe drive with an unknown distribution (presumably Fedora or Red Hat) and 1 million files. But the more comparable result is 185,185.18 files/s for the case with 12 SATA drives and a hardware RAID controller with 1 billion zero length files using an alpha version of RHEL 6.1 with a 2.6.38.3-18 kernel. But notice that the rate of files processed during the file system check decreases as the number of disks increases (compare the third row results with the fourth row results for ext4).
In addition, David commented on the ext4 results by stating the following:
300,000 files/s is about 75MB/s in sequential IO, easily within the reach of a single drive. But seeing as it didn't go any faster on a large, faster RAID storage capable of 750MB/s for sequential read IO, indicates that e2fsck is either CPU bound or sequential IO request bound. i.e. it's been optimized for the single disk case and can't really make use of the capabilities of RAID storage. Indeed, it went slower most likely because there is a higher per-IO cpu cost for the RAID storage (more layers, higher IO latency).

Expert Input

To determine if these file rates are within the ballpark or not, Red Hat was contacted for comment. Dave Chinner, one of the lead developers for xfs, had a few comments about the test results. When asked about the general result as to whether they were plausible, David said the following.
... the numbers are entirely possible. 600,000 inodes/s only requires about 1GB/s of IO throughput to achieve, and the DDN you tested on it is more than capable of this ...
He also commented on the xfs_repair rates relative to e2fsck.
Xfs_repair does extensive readahead itself, and some of the methods it uses are very effective on large RAID arrays so I would expect it to be faster than e2fsck for large scale file systems ...
Then we asked him what he thought about the results when comparing xfs to ext4 in terms of fsck performance. In our results, xfs_repair was about 2-8x faster than ext4 performance, while in some of the talks from Ric Wheeler mentioned previously xfs_repair was anywhere from 9x to 40x faster than ext4. Dave had this to say,
The difference in speed with xfs_repair depends on the density and distribution of the inodes and directory metadata. When you have zero length files, metadata is very dense, and xfs_repair will tend to do very large IOs and run hardware bandwidth (not iops) speed and be CPU bound processing all the incoming metadata.
The basic optimization premise that if the metadata is dense enough, we do a large IOs reading both data and metadata, and then chop it up in memory into metadata buffers for checking, throwing away the data. This is based on the observation that it takes less IO and CPU time to do a 2MB IO and chop it up in memory than it does to seek 50 times to read 50x4k blocks in that 2MB window.
For less dense distributions (like with your larger files), the amount of IO per inode or directory block increases, and therefore the speedup from those optimizations is not as great. In most aged file systems, however, the metadata distribution is quite dense (it naturally gets separated from the data) and so in general those optimizations result in a good speedup compared to reading metadata blocks individually.
When asked if the file system times and file process rate looked good, Chinner responded:
Yes, they are in the ballpark of what I'd expect. The latest version of xfs_repair also has some more memory usage reductions and optimizations that might also help improve large file system repair performance.
When asked about estimates of performance, David stated that with the hardware from DDN that was used, we could reach about 600,000 inodes/s or about 1 GB/s. He explained the 1 GB/s estimate.
It was a rough measurement based on typical inode densities I've seen fs_mark-like workloads produce.
By default, inodes are 256 bytes in size, packed into contiguous chunks of 64 inodes. So, it takes a 16k IO to read a single chunk of 64 inodes. If we have to read 10,000 inode chunks (640,000 inodes), it should only require reading 160MB of metadata. So the absolute minimum bandwidth from storage to 600,000 inodes/s from disk is around 160MB/s.
But inodes typically aren't that densely populated because there will often be directory and data blocks between inode chunks. So, if we have a 50 percent inode chunk density, xfs_repair will do large reads and discard the 50 percent of the space it reads (i.e., stuff that isn't metadata). Now we are at 320MB/s.
If we have typical small file inode densities, we'll be discarding about 85 percent of what we read in. So at a 1GB/s raw data read rate, we'd be pulling in roughly 150MB/s of inodes, or roughly 600,000 inodes/second ...
Now if we were reading those inodes in separate IOs, we'd need to be doing roughly 20,000 IOPS (inodes are read/written in 8k cluster buffers, not 16k chunks). This is the effect of the bandwidth vs IOPS trade-off we use to speed the reading of inodes into memory.
When asked for some final comments, Dave said,
e2fsck doesn't optimize its IO for RAID arrays. Its performance comes from being able to do all its metadata IO sequentially because it is mostly in known places (inodes, free space, etc). XFS dynamically allocates all its metadata, so it needs to be more sophisticated to scale well.
Also, you might want to try the ag_stride option to xfs_repair to further increase parallelism if it isn't already IO or CPU bound. That can make it go quite a bit faster.
It is probably also worth checking to see if you have enough memory for xfs_repair to cache all its metadata in memory. The same metadata needs to be read in phase 3, 4 and 5, so if it can be cached in phase 3, then phase 4 and 5 run at CPU speed rather than IO speed, and that can significantly improve runtime….
More information here from the talk I did all about this at LCA in 2008, specifically, slides 21 onwards show the breakdown of time spent in each phase as the number of inodes in the file system increases; slide 28 showing the effect of memory vs number of inodes; and slides 33 onwards showing the effect of ag_stride on performance on a 300M inode file system.
David also made a comment about our original suppositions that led to the article,
Repair scalability is not really an issue -- the problem is that finding the root cause of problems gets exponentially harder as file system size increases. So if you double your supported file size, expect to spend four times as much resources testing and supporting it. You can work out the business case from there ;)

Final Comments

Henry and I are both fans of Linux and want to see it succeed in every possible way, particularly in the HPC world where we both spend a great deal of time. We were disappointed to see that Red Hat supports Linux file systems only to 100TB. We knew that a number of key file system developers were working very hard to improve the scalability of Linux file systems. People such as Ric Wheeler, Dave Chinner, Eric Sandeen, Christoph Hellwig and Theodore Ts'o, just to name a few, were improving scalability of the major Linux file systems. Based on this supported limitations, we decided to do some testing around metadata performance as measured by a file system check.
The tests we developed are designed to be repeatable without being too specific to a particular fragmentation or file system damage pattern. The times may be on the optimistic side since no damage repair must be done, but it gives you something of an upper bound on file system check times. David Chinner's commented about this,
When there is damage, all bets are off. A file system that takes 15 minutes to check when there is no damage can take hours or days to repair when there is severe damage. Even minor damage can blow out repair times significantly. Not just the time it takes, but also the RAM required for repair to run to completion ...
Thus, trying to create "repeatable" file system repair tests is difficult at best.
The results indicated that the times to complete a file system check are within accepted norms. They also indicate that the metadata rates of xfs and ext4 are within what we call a "good range." The reason that they are "good" is that the file system checks can finish in less than a few hours, which is a very acceptable time for most admins.
We sincerely hope these are the first steps along the way toward better testing and development of Linux file systems. Developing tests that illustrate both the good and not so good aspects of file system behavior can only help the file systems get stronger. For example, the rather poor metadata performance of xfs drove Dave Chinner to focus on metadata development. We encourage vendors and the community to continue testing of the file systems, particularly larger scale testing since data never shrinks.
We hope to contribute to this testing as time allows, but if you are a vendor and have some hardware available for testing for a few weeks, we would love to collaborate or help in any way with testing.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.


No comments:

Post a Comment