Archives

Categories

SSD Endurance

I previously wrote about the issue of swap potentially breaking SSD [1]. My conclusion was that swap wouldn’t be a problem as no normally operating systems that I run had swap using any significant fraction of total disk writes. In that post the most writes I could see was 128GB written per day on a 120G Intel SSD (writing the entire device once a day).

My post about swap and SSD was based on the assumption that you could get many thousands of writes to the entire device which was incorrect. Here’s a background on the terminology from WD [2]. So in the case of the 120G Intel SSD I was doing over 1 DWPD (Drive Writes Per Day) which is in the middle of the range of SSD capability, Intel doesn’t specify the DWPD or TBW (Tera Bytes Written) for that device.

The most expensive and high end NVMe device sold by my local computer store is the Samsung 980 Pro which has a warranty of 150TBW for the 250G device and 600TBW for the 1TB device [3]. That means that the system which used to have an Intel SSD would have exceeded the warranty in 3 years if it had a 250G device.

My current workstation has been up for just over 7 days and has averaged 110GB written per day. It has some light VM use and the occasional kernel compile, a fairly typical developer workstation. It’s storage is 2*Crucial 1TB NVMe devices in a BTRFS RAID-1, the NVMe devices are the old series of Crucial ones and are rated for 200TBW which means that they can be expected to last for 5 years under the current load. This isn’t a real problem for me as the performance of those devices is lower than I hoped for so I will buy faster ones before they are 5yo anyway.

My home server (and my wife’s workstation) is averaging 325GB per day on the SSDs used for the RAID-1 BTRFS filesystem for root and for most data that is written much (including VMs). The SSDs are 500G Samsung 850 EVOs [4] which are rated at 150TBW which means just over a year of expected lifetime. The SSDs are much more than a year old, I think Samsung stopped selling them more than a year ago. Between the 2 SSDs SMART reports 18 uncorrectable errors and “btrfs device stats” reports 55 errors on one of them. I’m not about to immediately replace them, but it appears that they are well past their prime.

The server which runs my blog (among many other things) is averaging over 1TB written per day. It currently has a RAID-1 of hard drives for all storage but it’s previous incarnation (which probably had about the same amount of writes) had a RAID-1 of “enterprise” SSDs for the most written data. After a few years of running like that (and some time running with someone else’s load before it) the SSDs became extremely slow (sustained writes of 15MB/s) and started getting errors. So that’s a pair of SSDs that were burned out.

Conclusion

The amounts of data being written are steadily increasing. Recent machines with more RAM can decrease storage usage in some situations but that doesn’t compare to the increased use of checksummed and logged filesystems, VMs, databases for local storage, and other things that multiply writes. The amount of writes allowed under warranty isn’t increasing much and there are new technologies for larger SSD storage that decrease the DWPD rating of the underlying hardware.

For the systems I own it seems that they are all going to exceed the rated TBW for the SSDs before I have other reasons to replace them, and they aren’t particularly high usage systems. A mail server for a large number of users would hit it much earlier.

RAID of SSDs is a really good thing. Replacement of SSDs is something that should be planned for and a way of swapping SSDs to less important uses is also good (my parents have some SSDs that are too small for my current use but which work well for them). Another thing to consider is that if you have a server with spare drive bays you could put some extra SSDs in to spread the wear among a larger RAID-10 array. Instead of having a 2*SSD BTRFS RAID-1 for a server you could have 6*SSD to get a 3* longer lifetime than a regular RAID-1 before the SSDs wear out (BTRFS supports this sort of thing).

Based on these calculations and the small number of errors I’ve seen on my home server I’ll add a 480G SSD I have lying around to the array to spread the load and keep it running for a while longer.

2 comments to SSD Endurance