Wednesday, October 28, 2009

Anatomy of SSDs

SSDs (Solid-State Drives) are a hot topic right now for a number of reasons; not the least of which being their power to performance ratio.

But to better understand SSDs you should first get a grip on how they are constructed and the features/limitations of these drives.

SSDs are perhaps the hottest new hardware development in storage. They offer the promise of very high performance and low power.
From the lowly laptop SSD to the ultra high-performance of Fusion-IO and Texas Memory, SSDs have a great deal of buzz about them as witnessed by the number of reviews and tech-focused articles around the web and the print media.
As with all technologies there are benefits to them and there are limitations. This goal of this article is to help understand the technology including the benefits and limitations by beginning with the building blocks the NAND Flash chips. To truly understand them you have to start with the underlying technology, floating-gate transistors.
Floating-Gate Transistors
The concept of a floating-gate transistor is the key to understanding Flash memory or Flash storage and thus SSDs. Figure 1 below from Anandtech illustrates the floating-gate transistor
Figure 1 - Floating Gate Transistor
Figure 1 - Floating Gate Transistor
Between the floating gate and the substrate is the tunnel oxide. This is the barrier to the floating gate and is through which the electrons “tunnel” into the floating gate.

The transistor either has electrons tunneled into the floating gate (indicating a logical 0) or does not have any electrons tunneled into the floating gate (indicating a logical 1).

The process of forcing electrons to or from the floating gate, called Fowler-Nordheim Tunneling (F-N Tunneling), is achieved by applying a voltage between the control gate and the source or drain.

When the charge is removed the floating gate either retains the electrons if they were tunneled into the gate, or has no extra electrons if they were removed.

This allows Flash memory to retain values after power is removed.

To program (write) to the transistor, which creates a logical 0, a positive voltage is applied to the drain which activates the electrons underneath the floating gate in the substrate.

Then a larger positive voltage is applied to the control gate forcing the electrons to tunnel into the floating gate.

To erase the transistor, a negative voltage is applied to the control gate and a positive voltage is applied to the source.

This forces the electrons out of the floating gate and into the source.

To read the transistor a positive voltage that is lower than the write voltage is applied to the control gate.

An additional positive voltage, also lower than the write voltage, is applied to the drain.

The current between the source and the drain will determine if the floating gate has extra electrons (logical 0) or does not (logical 1).

It does take much more time to write (program) or erase the floating gate because of the time for the electrons to tunnel into the floating gate in the case of a write (program).

Erasing can also take more time because the electrons have to move out of the floating gate and into the source.

This erase process takes slightly less voltage than the write operation but much more than the read operation.

Currently there are two types of floating-gate arrays (cells) in use. The Single Level Cell (SLC) is read using the previously mentioned technique but the current is just sensed as being present or not (i.e. it’s not actually measured).

In the case of Multi-Level Cells (MLC) two levels of current can be sensed allowing the transistor to store two bits of if information instead of one.

The floating-gate transistors serve as the basis of what are called NAND Flash chips. In the next section how flash chips are constructed from floating-gate transistors will be discussed.

Creating NAND Flash Units
The most common cell used in SSDs is the NAND Flash. For this device, the transistors are connected in series.

Then these groups are connected in a NOR style where each line is connected directly to ground and the other is connected to a bit line.

This arrangement has advantages for cost, density, and power as well as performance but it is a compromise that has some implications as will be discussed later in the article.

In NAND Flash memory the cells are first arranged into pages. Typically, a page is 4KB in size. Then the pages are combined to form a block.

A block, illustrated below in Figure 2, is typically formed from 128 pages giving a block a size of 512KB.
Figure 2 - Block View of NAND Flash
Figure 2 - Block View of NAND Flash

The blocks are combined into a plane. Typically a total of 1,024 blocks are combined into a plane, giving it a typical size of 512MB as show in Figure 3.
Figure 3 - Plane View of NAND Flash
Figure 3 - Plane View of NAND Flash
Typically there will be multiple planes in a single NAND-flash die. Manufacturers will also put several dies into a single NAND flash chip. Then you can put multiple chips in a single drive.

Discussion of Features/Limitations of NAND Flash
Before you use a new technology it is always a good idea to have the best possible understanding of these so that you can make better, informed decisions and have a basis for understanding test results (good and bad).

SSDs are an exciting new technology that people are using in everyday life ranging from simple USB drives (not really SSD but they still use NAND Flash), to laptop drives, and even to enterprise class SSDs.

Much of this use is based primarily on performance considerations but also on power and the fact that it has no moving parts (think laptops).

In the following sub-sections, these attributes as well as others will be discussed from a feature/limitation perspective.

But, keep in mind that the features/limitations of NAND Flash are being discussed, not the features/limitations of SSD drives themselves.

Performance (Asymmetric and Otherwise)
Performance is probably the primary reason for the excitement around SSDs (as I always say, who doesn’t like more performance?).

To better understand performance of the NAND Flash chips, recall that there are two types of cells: SLC and MLC.

SLC can storage a single bit of data while MLC can store two bits. MLC sounds really great because you can store twice the amount of data compared to SLC but you pay a penalty for this extra data density.

Table 1, with data from this link illustrates the differences between SLC and MLC from a performance perspective.

Table 1 - Performance Timings of SLC and MLC
SLC NAND flashMLC NAND flash
Random Read25 μs50 μs
Erase2 ms per block2 ms per block
Programming (Write)250 μs900 μs

Notice that the read performance of MLC is twice as slow as SLC as could be expected, but the write performance is over 3 times slower.

Writing to NAND Flash is a multi-step process. In general, for writing to cells that have existing data, you first have to read the cells, followed by erasing the cells, and then program (write) to the cells.

If you look in Table 1, the read step is fairly fast, but the erase step is two orders of magnitude slower than reading.

The programming step, while not as slow as erasing, is still 10-20 times slower than a read.

Consequently, writing is not a fast operation compared to read.

This data points out that NAND Flash provides asymmetric performance. Reads are amazingly fast.

Programs (writes) are about 10 times slower than reads.

Erase/Program (writing over existing data) is 2-3 orders of magnitude slower than reads.

Data Retention Time
One of the attractive features of NAND Flash (SSD) is that they retain their information after the power is removed. It is reported that the data can be retained for 10-100 years.

The reason that they don’t last longer is that over time the electrons can “leak” from the floating gate resulting in data corruption (i.e. “bit rot”).

In addition, as the number of erase/program cycles increases, the retention period shortens (see this link for more information).

Data Corruption Due to Die Shrink
An additional problem pointed out in this link is that as the cells shrink in size, the probability of causing data corruption increases.

Recall that you need fairly high voltages to erase/program cells - in some cases, this can be up to 12V.

As the cell shrink, the distance between the source and drain diminishes, but the voltages stay approximately the same.

So the probability that a erase/program step might “disturb” a neighboring cell, possibly causing data corruption, increases.

Getting to larger densities may not be easily achieved because of this possible data corruption with this link indicating that the lower limit may be 20 nm.

However, companies are actively researching new materials and techniques to reduce the required voltages, allowing higher densities with the same data corruption probability.

Erase/Write Limits
This is probably one of the most mentioned limitations of NAND Flash chips - the limit to the number of times the transistor can go through an erase and write (program) cycle.

Recall that to write data, except for never been written-to cells, the data must be first read, then certain cells must be erased and then certain cells must be programmed (written).

After a certain number of cycles, the transistor can no longer retain electrons in the floating gate to a level that allows it to be used for storing data.

This limit is commonly referred to as the erase-write limit or just as the rewrite limit.

From the SNIA paper previously referenced, The following are the typical erase/program cycle limits for SLC and MLC NAND based flash cells:
  • SLC: 100,000
  • MLC: 5,000-10,000

These are typical erase/write cycle limits but the exact number depends upon the manufacturer.

For example in late 2008, Micron and Sun announced a SLC with a limit of 1,000,000 cycles.

Block Erasing
NAND Flash cells can be very easily read one byte (bit) at a time. Lower voltages are applied to the floating-gate transistor and the resulting current is measured.

You can easily write to a single page if it is pristine (i.e. no data has been written to it before). However, if there is existing data then the data needs to be erased as part of the write cycle.

Current NAND Flash chips can only be erased in units of blocks (512KB).

The block erase limitation can have a very large impact on performance. For example, if an application is re-writing data, then it is possible that only a few bytes in a block will need to be erased as part of the re-write.

But, this forces the entire contents of the block (512KB) to be read, temporarily stored somewhere, the block is erased, the existing non-changing data is merged with the new data and the resulting block is written to the block.

To possibly change a few bytes, the entire 512KB of data in the block has to be erased. This also includes the case when the block is partially used - a rewrite of any data in the block will force the entire block to be erased even if most of the block is not holding any data.

What is potentially important of this limitation as well is that all of the cells in the block have to be erased, and possibly written, consuming an erase/program cycle of that cell.

Recall that NAND Flash have a limited number of erase/program cycles, so using those cycles because only a very small part of the block is updated is a very expensive operation. In addition, the erase/program cycle is much slower than the read operation, possibly reducing write throughput.

Seek Times
One advantage that NAND Flash has over rotating media is seek time. For rotational media, the location of the data has to be computed, the drive head has to be moved to the right location and there may be a pause for the disk to rotate to the correct spot.

For data that is spread over the disk, this may force the drive head to move all over the disk possibly greatly reducing throughput.

However, SSD drives constructed from NAND Flash cells don’t suffer from this problem.

For NAND Flash, only the location of the bits/bytes needs to be computed and then the read operation can take place.

There is no mechanical movement - it is all done electrically.

Consequently the seek time, the amount of time it takes to find the data, is greatly reduced. For workloads where seek times are important (e.g. IOPS driven workloads), SSDs have a huge performance advantage over hard drives.

In fact, you can do reads in parallel if the drive controller and the drive is capable of parallel operations.

In summary, the features/limitations of NAND Flash are:
  • Very fast read performance
  • Asymmetric read/write performance (reads are 2-3 orders of magnitude faster than writes)
  • There are data retention limitations due to leakage and due to exercising the cells (i.e. using the erase/program cycles)
  • Shrinking the dies to increase density increases the probability of data corruption from erase/program functions disturbing neighboring cells
  • NAND Flash cells have a limited number of erase/program cycles before they can no longer retain data
  • NAND Flash cells can read a byte at a time or read/write a page at a time, but an entire block must be erased if one cell in the block is erased
  • Seek times for NAND Flash chips is much lower than hard drives

It may seem that the picture isn’t as rosy as reports have stated, but remember that these are the features/limitations of the NAND Flash cells themselves.

The next section will discuss several techniques that manufacturers have employed to build SSD drives and help overcome or at least moderate some of the limitations of the drives, resulting in some very high performance drives.

Chasing the Devil in the Details

There definitely is enough promise in NAND Flash chips to justify their development, but there are limitations or challenges that need to be addressed to make the resulting drives compelling.

Companies have been working on techniques for overcoming these limitations and this section will describe a few of them.

The two biggest problems that manufacturers have been addressing are: the erase/program cycle limitation, and the performance problem when overwriting old data (overcoming the slow read/erase/program cycle).

Erasing/Writing Data Scenario
Before jumping into techniques that are being used to improve SSDs, let’s examine a simple scenario that illustrates a fairly severe problem.

Let’s assume we have an application that wants to write some data to and SSD drive. The drive controller puts the data on some unused pages (i.e. they’ve never had data written to them so this is done with a write operation with no need to do an erase first).

This data is much smaller than the block (less than 512KB). Then some additional data is written to the SSD, and this too is written to unused pages in the same block.

Then the application decides to erase the first piece of data. Recall that the SSD can’t erase individual pages but only blocks. In an effort to same time and improve performance the drive controller just marks those pages as unused but they are not erased.

Next, another application writes data to the drive that will use the remaining pages in the block including the pages marked as unused.

This forces the controller to have to erase the pagesmarked as unused because there is existing data on them, but this forces the entire block to be erased.

The basic process that the controller goes through is something like the following:
  • Copy the entire contents of the block to a temporary location (likely cache)
  • Remove the unused data from the cache (this is the erased data from the first write)
  • Add the new data to the block in cache
  • Erase the targeted block on the SSD drive
  • Copy the entire block from the cache to the recently erased block
  • Empty the cache

As you can see the process can be very involved. Just because the first few pages of data were erased by the application this forced the entire block to be erased just to use those pages.

Recall that this can be a very expensive operation in terms of time because of the copying of data and erasing the block (recall that erasure is 2-3 orders of magnitude slower than a read).

This also kills write throughput performance. Many of the techniques discussed below are used to help overcome this performance problem.

Wear Leveling
One of the biggest challenges in using NAND Flash chips is the limited number of erase/program cycles and was one of the first addressed by manufacturers.

Many of the controllers in SSD drives keep track of the number of erase/program cycles in the drive.

The controller then tries to put data in locations to avoid “hot spots” where certain cells may have a much smaller number of erase/program cycles remaining.

This is commonly referred to as “wear leveling.” This approach has been fairly successful in avoiding hot spots within the drives but it does require a fair amount of work by the SSD drive controller.

The idea behind over provisioning is to have a “reserve” of spare pages and blocks (capacity) that can be used for various needs by the controller.

This spare capacity is not presented to the OS so only the drive controller knows about them. However this spare capacity does diminish the useable capacity of the drive.

For example the drive may have 64GB of actual capacity but the drive only appears as 50GB to the OS.

Therefore the drive has a spare capacity of 14GB (over-provisioned). In effect you are paying for space you cannot use. However, this spare capacity can be very useful.

Let’s return to the erasing/writing data scenario. The first few pages are marked as unused but haven’t been erased yet and the second data write has stored data on the pages.

Now the third data write needs the remaining pages including the unused pages on the block. This triggers the cycle of copying the entire block to a cache, merging the new data into the cache, erasing the targeted block, and writing the new block from cache to the drive.

But now, we have some extra space that might be useful with this third data write.

Instead of having to erase the unused portion of the block to accommodate the third data write, the controller can use some of the spare space instead.

This means that the sequence of reading the entire block, merging the new data, erasing the block, and writing the entire new block, can be avoided.

The controller just maps spare space to be part of the drive capacity (so it is seen by the OS) and moves the unused pages to the spare capacity portion of the drive.

Then the write occurs using the “fresh” spare space, but at some point the unused pages will have to be erased forcing the erase/write sequence and hurting performance.

In an effort to save performance there are some controllers that have logic that tries to do the erase/write sequence in the background or when the drive is not being used.

While this can work in some cases, it may not help drives that are very heavily used since there isn’t much time when the drive is “quiet.”

In addition to helping performance, the spare capacity can also be used when severe hot spots or bad areas develop in the drive.

For example,if a certain set of pages or even blocks has much fewer erase/write cycle remaining than most of the drive then the controller can just map spare pages or blocks to be used instead.

Moreover, the controller can watch for bad writes and use the spare capacity as a “backup” or bad spots (similar to extra blocks on hard drives).

The controller can check for bad writes by doing an immediate read after the write (recall that reads are 2-3 orders of magnitude faster than writes). If the read does not match the data then the write is considered bad.

The controller then remaps that part of the drive to some spare pages or blocks within the drive.

It’s fairly obvious that over-provisioning, while using capacity, can increased the performance and the data integrity of the drive.

Write Amplification Avoidance
One side effect of wear leveling is that sometimes the number of writes that a controller must perform to evenly the wear across the cells increases. But the number of writes (erase/write) is something that should be minimized since the cells have a limited number of them. SSD drive controllers go to great lengths to minimize the number of writes. With the inclusion of over provisioning, write amplification can be reduced by using the spare cells. The buffer space can also be combined with logic to hold data in a buffer and wait for some period of time anticipating additional data changes before the data is actually written. This too can help reduce the number of writes.

Internal RAID
While it is not specifically called out in many drives, newer SSD drive controllers are capable of performing internal RAID.

This is RAID for performance, not necessarily RAID for data reliability. Rotating media has a single drive head that actual does the reading and writing from the disk resulting in the actual IO path in the drive being a serial operation.

But SSD drives do not have a drive head. Consequently, the controllers and drives can be designed such that several data operations happen in parallel.

The obvious benefit of internal RAID are fairly obvious. A drive could have a version of RAID-0 to split the data into multiple parts for writing or reading (don’t forget that SSD drives already have fantastic read performance).

Data Coalescence
This is a fairly common technique where the controller holds the incoming write data in a buffer and reorders the operations to better suit the SSD drive.

For example it may hold incoming data for some period of time anticipating that more neighboring data may be forth coming.

This is especially important in the case of the block erase limitation. The controller will try to buffer data as long as possible, trying to reach one block in size in the buffer before committing the data to the drive.

This makes writing the data much more efficient because the entire block is full.

TRIM command
The TRIM command is a great way for SSD drives to maintain good performance by wiping pages clean when they are deleted, prior to new data being written to them.

The TRIM command isn’t in the Linux kernel as far as I know and drives that support the command are only now appearing. But TRIM support is available in Windows 7 (ouch).

The TRIM command works by forcing an actual erase of the unused pages during the data delete step where performance may not be as important as during a write step.

In other words, when a page or more is deleted by the application, it is erased immediately.

If we return to the erase/write scenario, after the second data write, the data from the first write is erased (removed).

Normally the drive controller defers the actual erase step until it is absolutely necessary.

Unfortunately, it becomes necessary when the very next write needs that unused space, forcing the block to go through the whole process of erase/write, impacting the write performance of the drive.

The TRIM command forces the controller to do the actual erase of the unused page during the data delete step.

The additional overhead of copying the good data from the block to cache, erasing the entire block, and then coping the cache back to the block, all happens during the data delete where performance may not be as big an issue as during a data write.

SSD drives are an exciting technology for data storage and IO performance. The drives have been out for a while and the prices are starting to gradually fall making them more appealing for broader use.

But as with all new technologies there are benefits and limitations.

This article is just a brief overview of the SSDs starting with the basic technology, floating-gate transistors, so that we can understand the source of the limitations (and benefits) of SSD drives.

Using floating-gate transistors the storage is built into pages, then blocks, then planes, and finally into chips and drives.

The benefits of SSD drives themselves have been discussed fairly pervasively:
  • The seek times are extremely small because there are no mechanical parts. This gives SSD drives amazing IOPS performance.
  • The performance is asymmetrical in reads and writes with reads being amazing fast and writes not so fast but still with very good performance.
  • While not discussed in this article, because there are no moving parts in the drive there is no danger of the drive head impacting the platters causing the lose of data.
This article focused a bit more on the the limitations of SSD drives that are a result of the floating-fate transistors but are also a result of the design of the NAND Flash arrays as well.

These limitations are;
  • The performance is asymmetrical in reads and writes with reads being amazing fast and writes not so fast but still with very good performance (this is both a feature and a limitation for SSDs).
  • Floating-gate transistors, and subsequently SSD drives, have a limited number of erase/program cycles after which they are incapable of storing any data. SLC cells have about a 100,000 cycle limit while MLC cells have about a 5,000-10,000 cycle limit.
  • Due to the construction of the NAND Flash chips, data can only be erased in block units (512KB) but can be written in page (4KB) units.
As pointed out some of these limitations give rise to problems in SSD drives. But SSD drive manufacturers are moving to address this problems as discussed in the article.

SSD drives are without a doubt very “cool” technology that can help solve some IO and storage challenges.

But before buying a fairly expensive drive, it is good to understand the limitations of the technology so you can make an informed decision.

Equally important is that understanding the limitations will help you understand any test results, either good or bad, for your workloads.

No comments:

Post a Comment