Archives

Categories

Some Ideas About Storage Reliability

Hard Drive Brands

When people ask for advice about what storage to use they often get answers like “use brand X, it works well for me and brand Y had a heap of returns a few years ago”. I’m not convinced there is any difference between the small number of manufacturers that are still in business.

One problem we face with reliability of computer systems is that the rate of change is significant, so every year there will be new technological developments to improve things and every company will take advantage of them. Storage devices are unique among computer parts for their requirement for long-term reliability. For most other parts in a computer system a fault that involves total failure is usually easy to fix and even a fault that causes unreliable operation usually won’t spread it’s damage too far before being noticed (except in corner cases like RAM corruption causing corrupted data on disk).

Every year each manufacturer will bring out newer disks that are bigger, cheaper, faster, or all three. Those disks will be expected to remain in service for 3 years in most cases, and for consumer disks often 5 years or more. The manufacturers can’t test the new storage technology for even 3 years before releasing it so their ability to prove the reliability is limited. Maybe you could buy some 8TB disks now that were manufactured to the same design as used 3 years ago, but if you buy 12TB consumer grade disks, the 20TB+ data center disks, or any other device that is pushing the limits of new technology then you know that the manufacturer never tested it running for as long as you plan to run it. Generally the engineering is done well and they don’t have many problems in the field. Sometimes a new range of disks has a significant number of defects, but that doesn’t mean the next series of disks from the same manufacturer will have problems.

The issues with SSDs are similar to the issues with hard drives but a little different. I’m not sure how much of the improvements in SSDs recently have been due to new technology and how much is due to new manufacturing processes. I had a bad experience with a nameless brand SSD a couple of years ago and now stick to the better known brands. So for SSDs I don’t expect a great quality difference between devices that have the names of major computer companies on them, but stuff that comes from China with the name of the discount web store stamped on it is always a risk.

Hard Drive vs SSD

A few years ago some people were still avoiding SSDs due to the perceived risk of new technology. The first problem with this is that hard drives have lots of new technology in them. The next issue is that hard drives often have some sort of flash storage built in, presumably a “SSHD” or “Hybrid Drive” gets all the potential failures of hard drives and SSDs.

One theoretical issue with SSDs is that filesystems have been (in theory at least) designed to cope with hard drive failure modes not SSD failure modes. The problem with that theory is that most filesystems don’t cope with data corruption at all. If you want to avoid losing data when a disk returns bad data and claims it to be good then you need to use ZFS, BTRFS, the NetApp WAFL filesystem, Microsoft ReFS (with the optional file data checksum feature enabled), or Hammer2 (which wasn’t production ready last time I tested it).

Some people are concerned that their filesystem won’t support “wear levelling” for SSD use. When a flash storage device is exposed to the OS via a block interface like SATA there isn’t much possibility of wear levelling. If flash storage exposes that level of hardware detail to the OS then you need a filesystem like JFFS2 to use it. I believe that most SSDs have something like JFFS2 inside the firmware and use it to expose what looks like a regular block device.

Another common concern about SSD is that it will wear out from too many writes. Lots of people are using SSD for the ZIL (ZFS Intent Log) on the ZFS filesystem, that means that SSD devices become the write bottleneck for the system and in some cases are run that way 24*7. If there was a problem with SSDs wearing out I expect that ZFS users would be complaining about it. Back in 2014 I wrote a blog post about whether swap would break SSD [1] (conclusion – it won’t). Apart from the nameless brand SSD I mentioned previously all of my SSDs in question are still in service. I have recently had a single Samsung 500G SSD give me 25 read errors (which BTRFS recovered from the other Samsung SSD in the RAID-1), I have yet to determine if this is an ongoing issue with the SSD in question or a transient thing. I also had a 256G SSD in a Hetzner DC give 23 read errors a few months after it gave a SMART alert about “Wear_Leveling_Count” (old age).

Hard drives have moving parts and are therefore inherently more susceptible to vibration than SSDs, they are also more likely to cause vibration related problems in other disks. I will probably write a future blog post about disks that work in small arrays but not in big arrays.

My personal experience is that SSDs are at least as reliable as hard drives even when run in situations where vibration and heat aren’t issues. Vibration or a warm environment can cause data loss from hard drives in situations where SSDs will work reliably.

NVMe

I think that NVMe isn’t very different from other SSDs in terms of the actual storage. But the different interface gives some interesting possibilities for data loss. OS, filesystem, and motherboard bugs are all potential causes of data loss when using a newer technology.

Future Technology

The latest thing for high end servers is Optane Persistent memory [2] also known as DCPMM. This is NVRAM that fits in a regular DDR4 DIMM socket that gives performance somewhere between NVMe and RAM and capacity similar to NVMe. One of the ways of using this is “Memory Mode” where the DCPMM is seen by the OS as RAM and the actual RAM caches the DCPMM (essentially this is swap space at the hardware level), this could make multiple terabytes of “RAM” not ridiculously expensive. Another way of using it is “App Direct Mode” where the DCPMM can either be a simulated block device for regular filesystems or a byte addressable device for application use. The final option is “Mixed Memory Mode” which has some DCPMM in “Memory Mode” and some in “App Direct Mode”.

This has much potential for use of backups and to make things extra exciting “App Direct Mode” has RAID-0 but no other form of RAID.

Conclusion

I think that the best things to do for storage reliability are to have ECC RAM to avoid corruption before the data gets written, use reasonable quality hardware (buy stuff with a brand that someone will want to protect), and avoid new technology. New hardware and new software needed to talk to new hardware interfaces will have bugs and sometimes those bugs will lose data.

Filesystems like BTRFS and ZFS are needed to cope with storage devices returning bad data and claiming it to be good, this is a very common failure mode.

Backups are a good thing.

8 comments to Some Ideas About Storage Reliability

  • Gabriel

    Hi,

    besides ZFS and BTRFS there’s also dm-integrity which should detect (and repair, combined with md) data corruption. Haven’t used it myself and I have no idea if it’s production-ready, but it would allow one to protect the data with regular file systems like ext4.

  • Ganbriel: dm-integrity is a good thing, it’s always good to have more choices about such things. The external integrity feature means you could have a couple of SSDs storing integrity data for a large number of hard drives for efficient use of hardware.

    But it’s not something I would want to use. ZFS and BTRFS have other features that provide real benefits.

    ZFS has RAID-Z which at a minimum is RAID-5 with checksums, but also is RAID-6 or RAID-7. ZFS also has extra copies of metadata and the option of extra copies of certain important data.

    Both ZFS and BTRFS support snapshots which are an easy (at least easier) feature with the way they perform data integrity.

    It seems that dm-integrity doesn’t cater for the case where a filesystem writes a data block and then writes metadata to point to it and doesn’t want the metadata to exist on disk without the data. I get the impression that if a data write and a write to the integrity data were lost at the same time then afterwards the system wouldn’t realise this. While with ZFS and BTRFS every metadata block has checksums of the metadata and/or data blocks it points to, the integrity of the entire tree back up to the superblock can be checked.

  • Gabriel

    My knowledge of dm-integrity is superficial at best (basically just reading a few blog posts online). The appeal of dm-integrity for home use is that it would allow one to add bit-rot detection/protection with RAID5/RAID6 setups. My NAS has disks of various sizes which I’ve partition in order to use all the available space. So the stack looks like this: physical disk -> partition -> md -> luks (for encryption) -> lvm -> filesystem. By adding dm-integrity, one should get all the protections provided by BTRFS/ZFS and keep the flexibility to use RAID5/6 (which is not production ready on BTRFS and a work in progress on ZFS).

    Regarding the data/metadata issue you mention, I have no idea. As I’ve said, I’m only vaguely familiar with dm-integrity, but thanks for pointing it out!

  • The ZFS version of RAID-5 is called RAID-Z and has been production ready for 10+ years. RAID-Z will give more integrity and performance than RAID-5 with dm-intenrity, and RAID-Z2 is also better than RAID-6.

  • Gabriel

    Apologies, when I said “work in progress on ZFS”, I was thinking of RAIDz expansion.

    The reason I chose lvm+mdadm when I built my NAS was that 1) it allowed me to use disks of different sizes efficiently and 2) it allowed me to grow it easily. I knew the downside would be giving up integrity checks.

    Looking forward to see both ZFS and BTRFS evolve.

  • RAID-Z expansion by replacing disks with bigger disks and then growing the array has always been supported. Expansion by adding another RAID-Z to the zpool has also always been supported. Adding a single disk to a RAID-Z and making it a RAID-Z across more disks has never been supported.

    For using disks of different sizes BTRFS has better support. My home server has a BTRFS RAID-1 array of 3*4TB disks, it’s starting to run a bit low on space so I’ll either replace a 4TB disk with an 8TB or add an extra 4TB disk, BTRFS makes both options easy.

  • Gabriel

    When I built my home server, I started with 3 disks: one 750GB(sda), one 1.5TB(sdb) and one 2TB(sdc). I partitioned the 2nd into two 750GB(sdb1 and sdb2) and the 3rd one into 2x750GB(sdc1 and sdc2) and 1x500GB(sdc3). The first RAID was a RAID5 between sd[abc]1, 2nd RAID was a RAID1 between sd[bc]2 and sdc3 was a RAID1 with just 1 disk. This way, all available space was used and I got as much redundancy as possible. Later, I added another 2TB disk (sdd) which was divided like the other 2TB disk. RAID5(sdX1) was expanded to 4 partitions, RAID1(sdX2) was converted to RAID5 and the RAID1(sdX3) with just one disk got redundancy.

    Would it have been possible to do the same (or better?) with ZFS or BTRFS (keeping in mind the goal was to not have any “dead” space and to be able to grow later)?

  • Gabriel: ZFS doesn’t have good support for adding extra disks into a running system unless you add them in a group. I once had a system with 9 disks in a RAID-Z and then added another 9 disks in another RAID-Z, that worked well, but adding 1 at a time wouldn’t have worked. BTRFS has great support for adding a disk at a time, removing a disk, or resizing just one disk and balancing things around, but BTRFS RAID-5 is not usable at this time and there are no projections for when it will be.