I just read an interesting paper titled An Analysis of Data Corruption in the Storage Stack [1]. It contains an analysis of the data from 1,530,000 disks running at NetApp customer sites. The amount of corruption is worrying, as is the amount of effort that is needed to detect them.
NetApp devices have regular “RAID scrubbing” which involves reading all data on all disks at some quiet time and making sure that the checksums match. They also store checksums of all written data. For “Enterprise” disks each sector stores 520 bytes, which means that a 4K data block is comprised of 8 sectors and has 64 bytes of storage for a checksum. For “Nearline” disks 9 sectors of 512 bytes are used to store a 4K data block and it’s checksum. These 64byte checksum includes the identity of the block in question, the NetApp WAFL filesystem writes a block in a different location every time, this allows the storage of snapshots of old versions and also means that when reading file data if the location that is read has data from a different file (or a different version of the same file) then it is known to be corrupt (sometimes writes don’t make it to disk). Page 3 of the document describes this.
Page 13 has an analysis of error location and the fact that some disks are more likely to have errors at certain locations. They suggest configuring RAID stripes to be staggered so that you don’t have an entire stripe covering the bad spots on all disks in the array.
One thing that was not directly stated in the article is the connection between the different layers. On a Unix system with software RAID you have a RAID device and a filesystem layer on top of that, and (in Linux at least) there is no way for a filesystem driver to say “you gave me a bad version of that block, please give me a different one”. Block checksum errors at the filesystem level are going to be often caused by corruption that leaves the rest of the RAID array intact, this means that the RAID stripe will have a mismatching checksum. But the RAID driver won’t know which disk has the error. If a filesystem did checksums on metadata (or data) blocks and the chunk size of the RAID was greater than the filesystem block size then when the filesystem detected an error a different version of the block could be generated from the parity.
NetApp produced an interesting guest-post on the StorageMojo blog [2]. One point that they make is that Nearline disks try harder to re-read corrupt data from the disk. This means that a bad sector error will result in longer timeouts, but hopefully the data will be returned eventually. This is good if you only have a single disk, but if you have a RAID array it’s often better to just return an error and allow the data to be retrieved quickly from another disk. NetApp also claim that “Given the realities of today’s drives (plus all the trends indicating what we can expect from electro-mechanical storage devices in the near future) – protecting online data only via RAID 5 today verges on professional malpractice“, it’s a strong claim but they provide evidence to support it.
Another relevant issue is the size of the RAID device. Here is a post that describes the issue of the Unrecoverable Error Rate (UER) and how it can impact large RAID-5 arrays [3]. The implication is that the larger the array (in GB/TB) the greater the need for RAID-6. It has been regarded for a long time that a larger number of disks in the array drove a greater need for RAID-6, but the idea that larger disks in a RAID array gives a greater need for RAID-6 is a new idea (to me at least).
Now I am strongly advising all my clients to use RAID-6. Currently the only servers that I run which don’t have RAID-6 are legacy servers (some of which can be upgraded to RAID-6 – HP hardware RAID is really good in this regard) and small servers with two disks in a RAID-1 array.
See also ZFS, where the filesystem has knowledge of the RAID and can ask for another block if the checksum is wrong. You can buy SATA disks that have time-limited error recovery – eg the Western Digital RE series.
Yes, without RAID6 your are only protected against a full disk failure. One question though: If you small server has two disks and presumably can be re-installed automatically in a matter of minutes, how much do you really gain with (software) RAID1?
There is more overhad and the only scenario where it can help is when one disk completely dies. If a disk returns garbage it won’t help since the other disk will (1) never be queried and (2) that would not make sense since then you have two blocks which should have been the same but aren’t. Throw a coin which one is the correct one ;)
> If you small server has two disks and presumably can be re-
> installed automatically in a matter of minutes, how much do
> you really gain with (software) RAID1?
uptime. For virtually no extra fee.
James: That’s a great feature, they are working on similar things for Linux, I expect that by the middle of next year there will be something ready to use.
Carsten: Big servers tend to be installed in big server rooms with staff available to do a reinstall. Small servers are often installed in out of the way locations where a machine that goes down might stay down for a while. More overhead doesn’t matter when the machine is at 1% of capacity.
If a disk returns garbage and claims it to be good data then all the common RAID implementations will fail. ZFA and NetApp seem to be the most common implementations that deal with this issue (I am sure that there are others but don’t know the details).
Also note that with RAID-6 if one disk returns a bad block and claims it to be good then it’s possible to read the other blocks, recognise the bad block, and fix it. But with RAID-5 you can’t do that. Of course this requires that you know via some other method that there is an error (or run a RAID scrubbing operation).
That’s the reason why we solely run RAID6 (or raidz2) on our ~ 600 TByte storage servers. About RAID1 I still don’t agree (much) there, I think it is more important to have a simple way of reinstalling the server (even from a distance, FAI is great for that) than have the extra bit with RAID1. I’d rather go with IPMI or a console server to boot into a rescue system or PXE-boot an installation.
Even in the remote places you need someone to go there fast if a fire starts, thus usually a server should never be really remote, right?
RAIDing the RAID…
In a not to distant past Russell Coker wrote about RAID Issues and referred in part to a report containing data from 1,530,000 disks running at NetApp customer sites. (also available in PDF or Postscript)
Interesting reading, for sure – particularly…