PC Engines NAND flash technical background


About \| APU2 \| Flash \| Tools \| Shop \| Support

	NAND flash technical background
Summary	NAND flash is an attractive and popular form of data storage, but is not without pitfalls. The following information was written with CompactFlash cards in mind, but you can expect the controllers in SD cards to have similar limitations. SATA controllers as used in mSATA and 2.5" SSDs have more memory, and don't suffer so much from slow random writes / write amplification.
NAND flash structure	Typical flash devices as used in cf2slc / cf4slc / cf8slc are structured as 2KB pages (with 64 bytes spare data for ECC + management data). 64 of these pages make up an erase block of 128 KB. Pages can be written individually, but an entire block must be erased at a time.
Write endurance	NAND flash must be erased before it can be rewritten. The cycle life depends on the flash technology, and becomes worse with each process generation. SLC flash (one bit per cell, used in our CF cards) can do about 30k to 100k cycles per erase block. MLC flash (two bits per cell, used in our SD cards and mSATA SSDs) should be good for about 1k to 10k cycles per block. TLC flash (three bits per cell) should support a few 100 cycles per block.
Why not use SLC flash for large cards ?	SLC flash is about 3.5x the price per bit for large capacities. MLC flash can be used in pseudo SLC mode (one bit per cell, half capacity) to get most of the benefits of SLC.
Write amplification	Erase blocks are 64 KB or more, so a single byte change can require a full block erase with a simple minded controller. This means that random or piecemeal writes can be quite expensive in terms of flash wear.
Wear leveling	CF controllers perform wear leveling to spread the erase cycles across multiple blocks, so frequently written blocks such as directories or file allocation tables don't wear out prematurely. Wear leveling algorithms are proprietary and undocumented - "secret sauce".
Read disturb	Reading is normally considered not to wear out the flash device. On recent MLC devices this is no longer a safe assumption - read disturb may require occasional rewriting of data. So much for the "read only" file system...
ECC	NAND flash is NOT guaranteed to be error free. The controller must implement an ECC code to allow error recovery, and replace bad blocks with spares. Again, on the 2 GB flash device used in cf2slc / cf4slc / cf8slc, out of 16384 erase blocks 16064 are guaranteed to be good at the end of the life cycle of the chip. The controller reserves additional spare sectors for mapping and internal operation - this is why you never get the full capacity.
How to corrupt CF cards ?	Easy... Start a write, and immediately (less than 1 second or so) do a system reset. Even if the sector gets written very quickly, the controller inside the CF card may still be busy with internal housekeeping. If you are unlucky some internal data structures will end up in a corrupted state. To avoid this, please ensure some delay (a few seconds) between sync and reboot. The flash controller may have some recovery procedure to clean up such an inconsistent state, but this takes time and may result in a BIOS time-out (error message: no boot device found).
CF card performance	Test results for our cards measured using HDtune pro software. As you can see, read performance is excellent, over 3000 IOPS for 4 KB random reads on the 4 GB / 8 GB cards. Sequential write performance is adequate (over 10 MB/s for 1 MB transfers). On the other hand, random write performance is poor (on the order of 18 to 24 IOPS for 4 KB random writes). Interestingly the 8 GB card is slower than the 4 GB card, I believe this is caused by larger management blocks to keep the allocation map size within reason. By the way, we measured between 2 and 5 random write IOPS on MLC based cards - which is why we no longer sell them.
Why are sequential writes fast ?	Most flash cards are optimized for this. Typical scenario - store a photo or video to flash. Besides, writing large amounts of data is easy - take a free block, write data to it, erase the block that was replaced.
Why are random writes so slow ?	The flash controller on a CF card has a very small RAM buffer, and thus cannot store a fine grain block allocation map. Typically the management block equals one erase block, or even 2 or 4 blocks if multiple flash devices are interleaved for faster sequential performance. For a single sector write, the following may happen: Write the new data to a new block (often called the child block). The previous version is called the mother block. These blocks can coexist for a while, but have to be consolidated at some point by copying unmodified pages from the mother block over to the child block. Then the mother block can be erased and reused. Please note that piecemeal writing (e.g. log file) in units of less than 512 bytes (could also be 2KB depending on the controller and flash) can get very time consuming.
Unaligned writes	Logical sectors are 512 bytes. Flash pages are 2 or 4 KB. With a typical geometry of 63 sectors per track, partitions may end up misaligned. Not much of a penalty for read access, but on write access the performance hit may be substantial. Modern hard drives with 4 KB physical blocks run into the same issue with older operating systems such as Windows XP.
How to work with flash, rather than against it ?	Make sure that partitions are properly aligned. Mount file systems as noatime. Updating the time of last access means a lot of unnecessary random writes. Live within your means (of DRAM), don't swap to disk. Preallocate files, avoid sparse files. This minimizes random writes for file system metadata. Store temporary files and logs on a RAM disk (e.g. tmpfs). If you need persistent log data, consider copying this data to the flash disk in larger chunks. Consider circular logs. Keep indexes in RAM rather than on disk. A single B-tree index update may result in multiple disk writes. Trade-off - startup will take a bit longer as you have to scan all records to build the index. Consider database structures that use a sequentially written journal instead of random writes. Avoid synchronous disk writes. With normal (asynchronous) writes the buffer cache can combine multiple writes to the same disk block, e.g. when appending data to a file.
Are you in sync ?	Common lore says that after sync all updated buffers are written to disk. That may not be the case - if you look at the man page for sync you will find that sync will schedule the dirty blocks for writing, but will not wait for these writes to complete. Recent Linux kernels also seem to have some problems in this area... Doing a simple sleep 5 after sync may not be enough to ensure proper writeback. Writing back 1 MB of dirty data may require 10 seconds if random writes are required - if the flash card can handle 25 IOPS. If the flash card is based on MLC flash, and your buffer cache is sufficiently big and dirty, sync could easily take an hour in pathological cases (all random accesses).
pdflush configuration	When disk blocks are updated, they are first written to the buffer cache in memory. They are then written back to disk by a kernel process (pdflush). This process scans the buffer cache and looks for "dirty" (modified) blocks that have been sitting around for a certain time. The advantage of this is that blocks can be combined for sequential writes, or might never make it to disk in the first place. A good description can be found here. /proc/sys/vm/dirty_expire_centisecs -> default = 3000, this means that dirty blocks expire after 30 seconds. I think 500 (5 seconds) would be more reasonable for a small system running on a flash disk. Do you really want your data to be exposed to power failures for 30 seconds ? /proc/sys/vm/dirty_bytes -> default = 0. I don't think it is a good idea to let too much dirty data accumulate in the buffer cache. Something like 1000000 (1 MB) seems more reasonable to me. /proc/sys/vm/dirty_writeback_centisecs = 500 -> This means that the pdflush process is kicked off every 5 seconds. I think 100 (1 second) is a bit more reasonable. If in doubt, consider adding a disk activity LED to your ALIX board (can be added quite easily, see page 6 of the board schematics).
Inspiration for flash file systems	LogFS Article on Microsoft FlashStore USB SuperCharger Self destructing flash drives (interesting article about forensics issues on SSDs) NVM Express - coming soon to a PCI slot near you.
Information sources	Data sheets - unfortunately the data sheets for most flash devices and controllers are only available under NDA. Issued and applied patents - do an assignee search, e.g. Silicon Motion. Test performance and guess what happens... If it takes a lot of time, chances are erase cycles and the wear that come with them are involved.
© 2002-2021 PC Engines GmbH. All rights reserved .