I just read an interesting paper titled An Analysis of Data Corruption in the Storage Stack . It contains an analysis of the data from 1,530,000 disks running at NetApp customer sites. The amount of corruption is worrying, as is the amount of effort that is needed to detect them.
NetApp devices have regular “RAID scrubbing” which involves reading all data on all disks at some quiet time and making sure that the checksums match. They also store checksums of all written data. For “Enterprise” disks each sector stores 520 bytes, which means that a 4K data block is comprised of 8 sectors and has 64 bytes of storage for a checksum. For “Nearline” disks 9 sectors of 512 bytes are used to store a 4K data block and it’s checksum. These 64byte checksum includes the identity of the block in question, the NetApp WAFL filesystem writes a block in a different location every time, this allows the storage of snapshots of old versions and also means that when reading file data if the location that is read has data from a different file (or a different version of the same file) then it is known to be corrupt (sometimes writes don’t make it to disk). Page 3 of the document describes this.
Page 13 has an analysis of error location and the fact that some disks are more likely to have errors at certain locations. They suggest configuring RAID stripes to be staggered so that you don’t have an entire stripe covering the bad spots on all disks in the array.
One thing that was not directly stated in the article is the connection between the different layers. On a Unix system with software RAID you have a RAID device and a filesystem layer on top of that, and (in Linux at least) there is no way for a filesystem driver to say “you gave me a bad version of that block, please give me a different one”. Block checksum errors at the filesystem level are going to be often caused by corruption that leaves the rest of the RAID array intact, this means that the RAID stripe will have a mismatching checksum. But the RAID driver won’t know which disk has the error. If a filesystem did checksums on metadata (or data) blocks and the chunk size of the RAID was greater than the filesystem block size then when the filesystem detected an error a different version of the block could be generated from the parity.
NetApp produced an interesting guest-post on the StorageMojo blog . One point that they make is that Nearline disks try harder to re-read corrupt data from the disk. This means that a bad sector error will result in longer timeouts, but hopefully the data will be returned eventually. This is good if you only have a single disk, but if you have a RAID array it’s often better to just return an error and allow the data to be retrieved quickly from another disk. NetApp also claim that “Given the realities of today’s drives (plus all the trends indicating what we can expect from electro-mechanical storage devices in the near future) – protecting online data only via RAID 5 today verges on professional malpractice“, it’s a strong claim but they provide evidence to support it.
Another relevant issue is the size of the RAID device. Here is a post that describes the issue of the Unrecoverable Error Rate (UER) and how it can impact large RAID-5 arrays . The implication is that the larger the array (in GB/TB) the greater the need for RAID-6. It has been regarded for a long time that a larger number of disks in the array drove a greater need for RAID-6, but the idea that larger disks in a RAID array gives a greater need for RAID-6 is a new idea (to me at least).
Now I am strongly advising all my clients to use RAID-6. Currently the only servers that I run which don’t have RAID-6 are legacy servers (some of which can be upgraded to RAID-6 – HP hardware RAID is really good in this regard) and small servers with two disks in a RAID-1 array.