Monday, September 21, 2009

I Feel the Need for Speed: Linux File System Throughput Performance, Part 1

While metadata performance is important, another critical metric for measuring file systems is throughput. We put three Linux file systems their paces with IOzone.

In two previous articles (here and here) we explored the metadata performance of a number of Linux file systems using a single micro-benchmark: fdtree.

fdtree as a micro-benchmark is very attractive because it is a simple bash script that uses recursion, forcing all cores to be used (extremely important with modern processors). It tests the ability of the file system to simply create directories and files in a tree-structure.

The file systems tested typically used their default options (except for ext3 and ext4) so tuning the file systems for this specific benchmark was not tested.

This article shifts from looking at metadata performance to examining data performance (sometimes referred to as throughput). However, we’ll start slow by first looking at one fairly common micro-benchmark: IOzone.

IOzone is a generally well-known and useful benchmark used to test data throughput and features a number of data access patterns and tuning options. The access patterns follow a range of applications and can be very useful for finding hotspots or bottlenecks even on deployed solutions.

As with the metadata benchmarks previously mentioned, the purpose of this study is not to compare file systems and pick the “best” one (insert your definition of “best”). Rather, this study is an exploration of the performance of various Linux file systems using a single throughput benchmark.

The focus of this article is to explore how Linux file systems perform when IOzone is used for a subset of its many options.


IOzone is open-source and written in ANSI C. It is capable of single thread, multi-threaded, and multi-client testing. The basic idea behind IOzone is to break up a file of a given size into records. Records are written or read in some fashion until the file size is reached. Using this concept, IOzone has a number of tests that can be performed:
  • Write
    This is a fairly simple test that simulates writing to a new file. Because of the need to create new metadata for the file, many times the writing of a new file can be slower than rewriting to an existing file. The file is written using records of a specific length (either specified by the user or chosen automatically by IOzone) until the total file length has been reached.
  • Re-write
    This test is similar to the write test but measures the performance of writing to a file that already exists. Since the file already exists and the metadata is present, it is commonly expected for the re-write performance to be greater than the write performance. This particular test opens the file, puts the file pointer at the beginning of the file, and then writes to the open file descriptor using records of a specified length until the total file size is reached. Then it closes the file which updates the metadata./LI>
  • Read
    This test reads an existing file. It reads the entire file, one record at a time.
  • Re-read
    This test reads a file that was recently read. This test is useful because operating systems and file systems will maintain parts of a recently read file in cache. Consequently, re-read performance should be better than read performance because of the cache effects. However, sometimes the cache effect can be mitigated by making the file much larger than the amount of memory in the system.
  • Random Read
    This test reads a file with the accesses being made to random locations within the file. The reads are done in record units until the total reads are the size of the file. The performance of this test is impacted by many factors including the OS cache(s), the number of disks and their configuration, disk seek latency, and disk cache among others.
  • Random Write
    The random write test measures the performance when writing a file with the accesses being made to random locations with the file. The file is opened to the total file size and then the data is written in record sizes to random locations within the file.
  • Backwards Read
    This is a unique file system test that reads a file backwards. There are several applications, notably, MSC Nastran, that read files backwards. There are some file systems and even OS’s that can detect this type of access pattern and enhance the performance of the access. In this test a file is opened and the file pointer is moved 1 record forward and then the file is read backward one record. Then the file pointer is moved 2 records forward in the file, and the process continues.
  • Record Rewrite
    This test measures the performance when writing and re-writing a particular spot with a file. The test is interesting because it can highlight “hot spot” capabilities within a file system and/or an OS. If the spot is small enough to fit into the various cache sizes; CPU data cache, TLB, OS cache, file system cache, etc., then the performance will be very good.
  • Strided Read
    This test reads a file in what is called a strided manner. For example, you could read data starting at a file offset of zero, for a length of 4 KB, then seek 200 KB forward, then read for 4 KB, then seek 200 KB, and so on. The constant pattern is important and the “distance” between the reads is called the stride (in this simple example it is 200 KB). This access pattern is used by many applications that are reading certain data structures. This test can highlight interesting issues in file systems and storage because the stride could cause the data to miss any striping in a RAID configuration, resulting in poor performance.
  • Fwrite
    This test measures the performance of writing a file using a library function “fwrite()”. It is a binary stream function (examine the man pages on your system to learn more). Equally important, the routine performs a buffered write operation. This buffer is in user space (i.e. not part of the system caches). This test is performed with a record length buffer being created in a user-space buffer and then written to the file. This is repeated until the entire file is created. This test is similar to the “write” test in that it creates a new file, possibly stressing the metadata performance.
  • Frewrite
    This test is similar to the “rewrite” test but using the fwrite() library function. Ideally the performance should be better than “Fwrite” because it uses an existing file so the metadata performance is not stressed in this case.
  • Fread
    This is a test that uses the fread() library function to read a file. It opens a file, and reads it in record lengths into a buffer that is in user space. This continues until the entire file is read.
  • Freread
    This test is similar to the “reread” test but uses the “fread()” library function. It reads a recently read file which may allow file system or OS cache buffers to be used, improving performance.

There are other options that can be tested, but for this exploration only the previously mentioned tests will be examined. However, even this list of tests is fairly extensive and covers a large number of application access patterns that you are likely to see (but not all of them).


The tests were run on the same system as the metadata tests. The system highlights of the system are:
  • GigaByte MAA78GM-US2H motherboard
  • An AMD Phenom II X4 920 CPU
  • 8GB of memory (DDR2-800)
  • Linux 2.6.30 kernel (with reiser4 patches only)
  • The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ultra ATA/100)
  • /home is on a Seagate ST1360827AS
  • There are two drives for testing. They are Seagate ST3500641AS-RK with 16 MB cache each. These are /dev/sdb and /dev/sdc.
Only the first Seagate drive was used, /dev/sdb, for all of the tests.

For IOzone the system specifications are fairly important. In particular, the amount of memory is important because this can have a large impact on the caching effects.

If the problem sizes are small enough to fit into the system or file system cache (or at least partially), it can skew results. Comparing the results of one system where the cache effects are fairly large to a system where cache effects are not large, is comparing the proverbial apples to oranges.

For example, if you run the same problem size on a system with 1GB of memory versus a system with 8GB you will produce much different results. Comparing them is comparing two completely different sets or results.


As mentioned previously there are a huge number of options available with IOzone (that is one reason it is so popular and powerful). For this exploration, the basic tests that are run are: write, re-write, read, re-read, random read, random write, backwards read, record re-write, strided read, fwrite, frewrite, fread, and freread.

One of the most important considerations for this test is whether cache effects are to be considered in the results or not. Including cache effects in the results can be very useful because it can point out certain aspects of the OS and file system cache sizes and how the caches function and affect performance.

On the other hand, including cache effects limits the usefulness of the data in comparison to other results.
For this article, cache effects will be limited as much as possible so that the impact of the file system designs on performance can be better observed.

Cache effects can’t be eliminated entirely without running extremely large problems and forcing the OS to eliminate all caches.

However, it is almost impossible to eliminate the hardware caches such as those in the CPU, so trying to eliminate all cache effects is virtually impossible (but never say never).

But, one way to minimize the cache effects is to make the file size much bigger than the main memory. For this article, the file size is chosen to be 16GB which is twice the size of main memory. This is chosen arbitrarily based on experience and some urban legends (”use a file size twice as big as main memory”)

Recall that most of the IOzone tests break up a file into records of a specific length. For example, a 1GB file can be broken into 1MB record so there are a total of 1,000 records in the file.

IOzone can either run an automatic sweep of record sizes or the user can fix the record size. If done automatically IOzone starts at 1KB (1,024 bytes) and then doubles the record size until it reaches a maximum of 16 MB (16,777,216 bytes).

Optionally, the user can specify the lower record size and the upper record size and IOzone will vary the record sizes in between.

For this article, with 16GB and 1KB record sizes, 1,000,000 records will be used for each of the 13 tests. The run times for this test are very large. Using our good benchmarking skills where each test is run at least 10 times, the total run time would be so large that, perhaps, only 1 benchmark every 2-4 weeks could be published.

Consequently, to meet editorial deadlines (and you don’t want to be late for the editor), the record sizes will be larger. For this article, only four record sizes are tested: (1) 1MB, (2) 4MB, (3) 8MB, and (4) 16MB.

For a file size of 16GB that is (1) 16,000 records, (2) 4,000 records, (3) 2,000 records, and (4) 1,000 records respectively. These record sizes and number of records do correspond to a number of applications so they do produce relevant results.

The specific command lines used are listed here. The command line for the first record size (1MB) is,
./iozone -Rb spreadsheet_output_1M.wks -s 16G -r 1M > output_1M.txt
The command line for the second record size (4MB) is,
./iozone -Rb spreadsheet_output_4M.wks -s 16G -r 4M > output_4M.txt
The command line for the third record size (28MB) is,
./iozone -Rb spreadsheet_output_8M.wks -s 16G -r 8M > output_8M.txt
The command line for the fourth record size (16MB) is,
./iozone -Rb spreadsheet_output_16M.wks -s 16G -r 16M > output_16M.txt

As mentioned previously, there are 13 tests that are each run 10 times and there are 4 record sizes. This makes a total of 520 tests that were run per file system. To get an idea of the run time for each test, Table 1 below lists the average run times in seconds and standard deviations (in red below the average) for each of the file systems tested. These times include the time to run all 4 record sizes (the individual tests were not timed separately).

Table 1 - Run times (secs) for Testing
File System Time (secs)
Reiser4 9,628.00
5.27
ext3 12,784.90
91.14
ext4 9,826.30
36.57

The standard deviations for these three file systems is very low compared to the average (less than 1%) showing the repeatability of the tests. So the tests satisfy the general benchmark requirement of running more than a few seconds or at least a minute.

Now the juicy part of the results - the throughput results. Because of the large number of tests that are run, the results are split into two groups.

The first group is for the write tests: write, re-write, random write, record re-write, fwrite, frewrite.

The second group is for the read tests: read, re-read, random reads, backwards read, strided read, fread, and freread. Each table below is for one of the two groups for a specific record size (1MB, 4MB, 8MB, 16MB). So that means there are 8 tables of results.

The first two tables of results are for the 1MB record size. Table 2 below presents the throughput in KB/s for the file systems for the 6 write tests.

Table 2 - IOzone Write Performance Results with a Record Length of 1MB and a File Size of 16GB
File System Write
KB/s
Re-write
(KB/s)
Random write
(KB/s)
Record re-write
(KB/s)
fwrite
(KB/s)
frewrite
(KB/s)
Reiser4 100,476.50
167.31
94,290.90
231.34
62,220.00
98.06
3,258,345.00
372,536.39
100,095.70
197.08
94,258.60
156.25
ext3 72,938.10
272.06
75,759.60
435.48
53,709.90
712.81
2,715,427.60
22,943.63
72,705.10
573.25
75,250.30
492.69
ext4 109,339.90
370.61
103,843.50
8,348.59
66,683.50
445.16
2,980,795.70
42,895.82
109,147.50
180.18
108,184.30
165.20

Table 3 below presents the throughput in KB/s for the file systems for the 7 read tests for a record length of 1MB.

Table 3 - IOzone read Performance Results with a Record Length of 1MB and a File Size of 16GB
File System Read
(KB/s)
Re-read
(KB/s)
Random read
(KB/s)
Backwards read
(KB/s)
Strided read
(KB/s)
fread
(KB/s)
freread
(KB/s)
Reiser4 106,777.20
61.33
106,861.50
22.18
61,980.50
78.35
66,201.50
144.74
49,071.90
122.04
106,719.80
39.03
106,787.60
61.63
ext3 71,903.60
2,688.14
76,442.60
4,827.36
52,851.20
79.31
59,207.50
61.13
49,839.60
56.14
71,746.00
3,857.21
70,165.50
2,430.79
ext4 97,969.20
46.97
98,050.60
55.21
52,620.10
104.39
75,419.50
135.84
51,048.90
35.03
98,039.90
15.35
98,041.30
54.90

The next two tables of results are for the 4MB record size. Table 4 below presents the throughput in KB/s for the file systems for the 6 write tests.

Table 4 - IOzone Write Performance Results with a Record Length of 4MB and a File Size of 16GB
File System Write
KB/s
Re-write
(KB/s)
Random write
(KB/s)
Record re-write
(KB/s)
fwrite
(KB/s)
frewrite
(KB/s)
Reiser4 100,360.50
181.21
94,389.00
217.43
71,415.40
306.27
2,711,385.40
269,436.91
99,821.10
165.40
94,180.20
197.30
ext3 73,167.60
478.11
75,142.20
726.40
58,419.20
680.76
2,275,931.10
69,312.23
72,502.40
464.13
75,054.70
999.82
ext4 109,407.30
142.35
101,720.10
9,752.95
76,935.90
406.46
2,496,937.70
91,639.95
109,115.20
220.57
105,865.70
6,536.20

Table 5 below presents the throughput in KB/s for the file systems for the 7 read tests for a record length of 4MB.

Table 5 - IOzone Read Performance Results with a Record Length of 4MB and a File Size of 16GB
File System Read
(KB/s)
Re-read
(KB/s)
Random read
(KB/s)
Backwards read
(KB/s)
Strided read
(KB/s)
fread
(KB/s)
freread
(KB/s)
Reiser4 106,758.60
63.64
106,772.40
63.60
95,476.80
63.67
99,972.00
162.45
88,978.80
109.89
106,834.40
34.88
106,828.60
74.42
ext3 72,267.70
3,011.67
73,857.90
4,154.39
71,447.60
122.82
82,327.20
195.91
81,040.40
221.83
72,087.80
3,449.44
71,736.10
3,411.92
ext4 97,929.80
43.56
98,041.80
44.83
86,095.00
108.35
102,674.20
88.16
88,711.90
113.00
98,012.70
45.39
98,054.10
37.33

The next two tables of results are for the 8MB record size. Table 6 below presents the throughput in KB/s for the file systems for the 6 write tests.

Table 6 - IOzone Write Performance Results with a Record Length of 8MB and a File Size of 16GB
File System Write
KB/s
Re-write
(KB/s)
Random write
(KB/s)
Record re-write
(KB/s)
fwrite
(KB/s)
frewrite
(KB/s)
Reiser4 100,374.50
202.35
94,419.50
189.81
74,282.60
483.86
2,435,265.10
74,257.84
99,884.00
158.36
94,109.20
191.79
ext3 73,112.90
298.23
75,521.10
500.19
60,274.30
294.63
1,402,615.20
38,937.18
72,796.30
412.19
75,159.00
651.18
ext4 109,327.10
171.36
104,004.90
8,327.54
79,617.30
647.18
1,532,668.10
33,213.47
109,296.60
95.56
107,530.10
1,743.57

Table 7 below presents the throughput in KB/s for the file systems for the 7 read tests for a record length of 8MB.

Table 7 - IOzone Read Performance Results with a Record Length of 8MB and a File Size of 16GB
File System Read
(KB/s)
Re-read
(KB/s)
Random read
(KB/s)
Backwards read
(KB/s)
Strided read
(KB/s)
fread
(KB/s)
freread
(KB/s)
Reiser4 106,770.60
63.29
106,837.40
59.29
106,224.90
98.67
107,486.60
126.50
101,737.00
183.12
106,792.90
52.36
106,862.30
26.13
ext3 74,145.80
4,309.27
77,749.70
3,741.80
76,112.30
66.45
81,136.90
221.88
83,584.60
147.11
71,949.30
2,636.84
72,275.10
3,634.02
ext4 97,985.40
40.32
98,031.20
55.91
98,299.90
195.36
108,635.40
159.98
100,060.90
128.50
97,968.30
43.66
98,053.30
18.49

The final two tables of results are for the 16MB record size. Table 8 below presents the throughput in KB/s for the file systems for the 6 write tests.

Table 8 - IOzone Write Performance Results with a Record Length of 16MB and a File Size of 16GB
File System Write
KB/s
Re-write
(KB/s)
Random write
(KB/s)
Record re-write
(KB/s)
fwrite
(KB/s)
frewrite
(KB/s)
Reiser4 100,590.20
784.99
94,398.70
232.84
78,827.60
460.14
2,346,101.20
38,954.92
99,920.00
192.24
94,045.80
393.75
ext3 72,999.50
397.39
75,406.00
452.57
62,060.20
457.09
1,343,648.10
30,995.74
72,988.20
673.37
74,796.40
650.74
ext4 109,348.90
141.87
103,807.50
8,477.20
81,682.40
1,304.42
1,505,397.20
41,948.58
109,129.30
201.36
101,880.90
9,679.37

Table 9 below presents the throughput in KB/s for the file systems for the 7 read tests for a record length of 16MB.

Table 9 - IOzone Read Performance Results with a Record Length of 16MB and a File Size of 16GB
File System Read
(KB/s)
Re-read
(KB/s)
Random read
(KB/s)
Backwards read
(KB/s)
Strided read
(KB/s)
fread
(KB/s)
freread
(KB/s)
Reiser4 106,780.70
39.83
106,821.40
44.57
112,629.90
67.25
112,646.50
178.95
110,757.00
149.74
106,736.50
66.59
106,850.50
16.02
ext3 74,781.80
4,466.66
76,942.70
4,478.32
78,993.80
83.48
81,133.80
114.28
84,995.70
266.05
73,546.40
3,656.29
71,958.20
3,442.78
ext4 97,943.60
31.41
98,040.20
39.20
105,651.60
151.52
111,708.70
219.17
106,597.60
127.14
97,944.60
53.76
98,053.00
22.69


Recall that this article is really an exploration and not a comparison. Its intent is to examine the throughput performance of various Linux file systems using IOzone.

In addition, only 3 file systems are covered here so even doing the inevitable comparison at this stage is a bit premature. However, there are some interesting observations that can be made:
  • Write and re-write performance are relatively insensitive to record sizes over the range tested (1MB to 16MB) for the parameters of this test.
  • Random write got much better a record size increased from 1MB to 16MB. For example, ext3 went from 53,709.90 KB/s to 62,060.20 KB/s.
  • Record re-write performance got worse as the record size increased. This performance can be driven by various cache and buffer sizes. Perhaps as the record size increases, some cache size are exceeded and performance decreases.
  • Read and re-read performance are relatively insensitive to record size over the range tested (1MB to 16MB) for the parameters of this test.
  • Random read performance got dramatically better as the record increased. For example at 1MB the random read performance for ext4 was 52,620.10 KB/s and at a record size of 16MB it was 105,651.60 KB/s. This is about a 100% increase in throughput performance! An increase in performance is expected because there are fewer records, but the change in performance is fairly dramatic.
  • Backward read performance also dramatically improved with increasing record size. This is to be expected since there is less drive head movement because there are fewer total records.
  • Strided read throughput performance was also dramatically improved as the record size was increased. Again, it is about a 100% increase in performance from a record size of 1MB to a record size of 16MB.
  • Fread and freread performance are relatively insensitive to a change in record size for the parameters test here.
It is almost impossible to help yourself in doing even a small file system comparison even at this stage. One of the more obvious comparisons is that the newer file systems, reiser4 and ext4 have remarkably greater throughput performance compared to ext3.

However, what is also somewhat surprising is that resier4 is known for it’s performance, yet, ext4, which is designed to have some backwards compatibility with ext3, almost matches its performance for the tests in this article.

But, this does not mean that ext4 gives the same performance as reiser4 overall or that one should select one file system over another based on these results alone.

As you can probably guess in the near future there will be additional results for the other file systems. Since there are so many options as well as file systems, it may take a couple of articles to finish all of the benchmarks. So keep an eye out for these upcoming article(s).

No comments:

Post a Comment