Linux, politics, and other interesting things
Should you use software or hardware RAID? Many people claim that Hardware RAID is needed for performance (which can be true) but then claim that it’s because of the CPU use of the RAID calculations.
Here is the data logged by the Linux kernel then the RAID-5 and RAID-6 drivers are loaded on a 1GHz Pentium-3 system:
raid5: automatically using best checksumming function: pIII_sse
pIII_sse : 2044.000 MB/sec
raid5: using function: pIII_sse (2044.000 MB/sec)
raid6: int32x1 269 MB/s
raid6: int32x2 316 MB/s
raid6: int32x4 308 MB/s
raid6: int32x8 281 MB/s
raid6: mmxx1 914 MB/s
raid6: mmxx2 1000 MB/s
raid6: sse1x1 800 MB/s
raid6: sse1x2 1046 MB/s
raid6: using algorithm sse1x2 (1046 MB/s)
There are few P3 systems that have enough IO capacity to support anywhere near 2000MB/s of disk IO and modern systems give even better CPU performance.
The fastest disks available can sustain about 80MB/s when performing contiguous disk IO (which incidentally is a fairly rare operation). So if you had ten fast disks performing contiguous IO then you might be using 800MB/s of disk IO bandwidth, but that would hardly stretch your CPU performance. It’s obvious that CPU performance of the XOR calculations for RAID-5 (and the slightly more complex calculations for RAID-6) is not a bottleneck.
Hardware RAID-5 often significantly outperforms software RAID-5 (in fact it should always outperform software RAID-5) even though in almost every case the RAID processor has significantly less CPU power than the main CPU. The benefit for hardware RAID-5 is in caching. A standard feature in a hardware RAID controller is a write-back disk cache in non-volatile RAM (RAM that has a battery backup and can typically keep it’s data for more than 24 hours without power). All RAID levels benefit from this but RAID-5 and RAID-6 gain particular benefits. In RAID-5 a small write (less than the stripe size) requires either that all the blocks other than the ones to be written are read or that the original content of the block to be written and the parity block are read – in either case writing less than a full stripe to a RAID-5 requires some reads. If the write-back cache can store the data for long enough that a second write is performed to the same stripe (EG to files being created in the same Inode block) then performance may double.
There is one situation where software RAID will give better performance (often significantly better performance), that is for low-end hardware RAID devices. I suspect that some hardware RAID vendors deliberately cripple the performance of low-end RAID devices (by using an extremely under-powered CPU among other things) to drive sales of the more expensive devices. In 2001 I benchmarked one hardware RAID controller as being able to only sustain 10MB/s for contiguous read and write operations (software RAID on lesser hardware would deliver 100MB/s or more). But for random synchronous writes the performance was great and that’s what mattered for the application in question.
Also there are reliability issues related to write-back caching. In a well designed system an update of an entire RAID-5 stripe (one block to each disk including the parity block) will first be performed to the cache and then the cache will be written back. If the power fails while the write is in progress then it will be attempted again when power is restored thus ensuring that all disks have the same data. With any RAID implementation without such a NVRAM cache a write to the entire stripe could be partially successful. This means that the parity block would not match the data! In such a situation the machine would probably work well (fsck would ensure that the filesystem state was consistent) until a disk failed. When the RAID-5 recovery procedure is used after a disk is failed it uses the parity block to re-generate the missing data, but if the parity doesn’t match then the re-generated data will be different. A disk failure may happen while the machine is online and this could potentially result in filesystem and/or database meta-data changing on a running system – this is a bad situation that most filesystems and databases will not handle well.
A further benefit of a well designed NVRAM cache is that it can be used on multiple systems. For their servers HP makes some guarantees about which replacement machines will accept the NVRAM module. So if you have a HP server running RAID-5 with an NVRAM cache then you could have the entire motherboard die, have HP support provide a replacement server, then when the replacement machine is booted with the old hard drives and NVRAM module installed the data in the write-back cache will be written! This is a significant feature for improving reliability in bad corner cases. NB I’m not saying that HP is better than all other RAID vendors in this regard, merely that I know what HP equipment will do and don’t know about the rest.
It would be good if there was a commodity standard for NVRAM on a PC motherboard. Perhaps a standard socket design that Intel could specify and that every motherboard manufacturer would eventually support. Then to implement such things on a typical PC all that would be required would be the NVRAM module, which while still being expensive would be significantly cheaper than current prices due to the increase in volume. If there was a significant quantity of PCs with such NVRAM (or which could be upgraded to it without excessive cost) then there would be an incentive for people to modify the Linux sotware RAID code to use it and thus give benefits for performance and reliability. Then it could be possible to install a NVRAM module and drives in a replacement server with Linux software RAID and have the data integrity preserved. But unless/until such things happen write-back caching that preserves the data integrity requires hardware RAID.
Another limitation of Linux software RAID is expanding RAID groups. A HP server that I work on had two disks in a RAID-1 array, one of my colleagues added an extra disk and made it a RAID-5, the hardware RAID device moved the data around as appropriate while the machine was running and the disk space was expanded without any down-time. Some similar things can be done with Linux, for example here is documentation on converting RAID-1 to RAID-5 with Linux software RAID . But that conversion operation requires some down-time and is not something that’s officially supported, while converting RAID-1 to RAID-5 with HP hardware RAID is a standard supported feature.Best Posts, Most Popular