As solid state drives (SSDs) take on an increased role in data centers, one of the big question marks hanging over the devices is durability. Hard drive mean time before failure (MTBF) is fairly well established since the hard disk drive is basically 50 years old. The SSD is, at least in its modern incarnation, about a decade old.
The concern is over wear. As you write to the cells, they slowly degrade. Eventually the cells die and the bit of data stored in them is lost. Lose enough cells and your SSD is a paperweight. And considering that SSDs are being used as the fast data access devices, often called "hot storage," that makes knowing their lifespan all the more urgent.
There are three types of SSD memory: single-level cell (SLC), which has one bit per cell, multi-level cell (MLC) capable of storing two bits per cell, and triple-level cell (TLC), which holds three bits per cell. SLC is thought to be more durable than MLC, as much as 10 times as durable since the cells were written to less frequently. Most if not all enterprise-oriented SSD drives are made from NAND flash memory of SLC design, which is more expensive than MLC or TLC.
SLC-based SSDs are not necessarily more reliable than MLC-based solid state drives at all. Other factors not related to wear affect its health.
+ ALSO ON NETWORK WORLD Debunking SSD Myths +
Testing PCI Express-based SSD drives, Google found that if a drive used for four years – and that's four years of non-stop work in Google's data centers -- anywhere from 20% to 63% of drives and two to six out of every 1,000 drive days (of use) have a chance of developing at least one uncorrectable read error. That means the data cannot be recovered.
As for write errors, depending on the model, anywhere from 1.5% to 2.5% of drives and one to four out of 10,000 drive days experienced a final write error, or a failed write operation that did not succeed even after retries. The reason the number of write errors was so much lower is that when there is a write error, usually the server will try to write somewhere else. So if a write error crops up, the problem is pretty bad and widespread.
The report said that the age of a drive, the amount of time spent in the field, affects reliability more than wear, which is the main concern for most people. What's the difference? Deployment means NAND flash chips getting heated up from use and being in a server and having electricity run through them. Regular DRAM, motherboards, GPUs and even sound cards can die even though they have no moving parts. I recently had a sound card die on me. These things all eventually die even though they have no moving parts due to heat and electricity running through them.
Google found that SLC drives do not perform better for those measures of reliability that matter most in practice, like lower repair or replacement rates, and don’t typically have lower rates of non-transparent errors.
Instead, it found performance and reliability improved with smaller lithographies. NAND has struggled to get below 19nm, but compared to the older flash drives made with a 50nm lithography, they are much more durable. That's a reflection of the age of this study. Six years is an eternity in data center technology and the differences in the NAND used was vast.
SSDs are prone to errors due to errors in the manufacturing process of the chips, the study said, but added that flash-based drives are in fact much more reliable than spinning disks. "Comparing with traditional hard disk drives, flash drives have a significantly lower replacement rate in the field, however, they have a higher rate of uncorrectable errors."
Now, before you freak out over your PC or laptop, don't. Four or six years in a data center is about 20 to 30 years for a home user. We're not talking non-stop read/write activity for a home user. But if you are concerned, grab a utility called SSD Life, which will tell you how many years your drive has left. My 3-year-old SSD is 99% healthy and projected to last until 2024.