Archives

Categories

Software vs Hardware RAID

Should you use software or hardware RAID? Many people claim that Hardware RAID is needed for performance (which can be true) but then claim that it’s because of the CPU use of the RAID calculations.

Here is the data logged by the Linux kernel then the RAID-5 and RAID-6 drivers are loaded on a 1GHz Pentium-3 system:

raid5: automatically using best checksumming function: pIII_sse
  pIII_sse  :  2044.000 MB/sec
raid5: using function: pIII_sse (2044.000 MB/sec)
raid6: int32x1    269 MB/s
raid6: int32x2    316 MB/s
raid6: int32x4    308 MB/s
raid6: int32x8    281 MB/s
raid6: mmxx1      914 MB/s
raid6: mmxx2    1000 MB/s
raid6: sse1x1    800 MB/s
raid6: sse1x2    1046 MB/s
raid6: using algorithm sse1x2 (1046 MB/s)

There are few P3 systems that have enough IO capacity to support anywhere near 2000MB/s of disk IO and modern systems give even better CPU performance.

The fastest disks available can sustain about 80MB/s when performing contiguous disk IO (which incidentally is a fairly rare operation). So if you had ten fast disks performing contiguous IO then you might be using 800MB/s of disk IO bandwidth, but that would hardly stretch your CPU performance. It’s obvious that CPU performance of the XOR calculations for RAID-5 (and the slightly more complex calculations for RAID-6) is not a bottleneck.

Hardware RAID-5 often significantly outperforms software RAID-5 (in fact it should always outperform software RAID-5) even though in almost every case the RAID processor has significantly less CPU power than the main CPU. The benefit for hardware RAID-5 is in caching. A standard feature in a hardware RAID controller is a write-back disk cache in non-volatile RAM (RAM that has a battery backup and can typically keep it’s data for more than 24 hours without power). All RAID levels benefit from this but RAID-5 and RAID-6 gain particular benefits. In RAID-5 a small write (less than the stripe size) requires either that all the blocks other than the ones to be written are read or that the original content of the block to be written and the parity block are read – in either case writing less than a full stripe to a RAID-5 requires some reads. If the write-back cache can store the data for long enough that a second write is performed to the same stripe (EG to files being created in the same Inode block) then performance may double.

There is one situation where software RAID will give better performance (often significantly better performance), that is for low-end hardware RAID devices. I suspect that some hardware RAID vendors deliberately cripple the performance of low-end RAID devices (by using an extremely under-powered CPU among other things) to drive sales of the more expensive devices. In 2001 I benchmarked one hardware RAID controller as being able to only sustain 10MB/s for contiguous read and write operations (software RAID on lesser hardware would deliver 100MB/s or more). But for random synchronous writes the performance was great and that’s what mattered for the application in question.

Also there are reliability issues related to write-back caching. In a well designed system an update of an entire RAID-5 stripe (one block to each disk including the parity block) will first be performed to the cache and then the cache will be written back. If the power fails while the write is in progress then it will be attempted again when power is restored thus ensuring that all disks have the same data. With any RAID implementation without such a NVRAM cache a write to the entire stripe could be partially successful. This means that the parity block would not match the data! In such a situation the machine would probably work well (fsck would ensure that the filesystem state was consistent) until a disk failed. When the RAID-5 recovery procedure is used after a disk is failed it uses the parity block to re-generate the missing data, but if the parity doesn’t match then the re-generated data will be different. A disk failure may happen while the machine is online and this could potentially result in filesystem and/or database meta-data changing on a running system – this is a bad situation that most filesystems and databases will not handle well.

A further benefit of a well designed NVRAM cache is that it can be used on multiple systems. For their servers HP makes some guarantees about which replacement machines will accept the NVRAM module. So if you have a HP server running RAID-5 with an NVRAM cache then you could have the entire motherboard die, have HP support provide a replacement server, then when the replacement machine is booted with the old hard drives and NVRAM module installed the data in the write-back cache will be written! This is a significant feature for improving reliability in bad corner cases. NB I’m not saying that HP is better than all other RAID vendors in this regard, merely that I know what HP equipment will do and don’t know about the rest.

It would be good if there was a commodity standard for NVRAM on a PC motherboard. Perhaps a standard socket design that Intel could specify and that every motherboard manufacturer would eventually support. Then to implement such things on a typical PC all that would be required would be the NVRAM module, which while still being expensive would be significantly cheaper than current prices due to the increase in volume. If there was a significant quantity of PCs with such NVRAM (or which could be upgraded to it without excessive cost) then there would be an incentive for people to modify the Linux sotware RAID code to use it and thus give benefits for performance and reliability. Then it could be possible to install a NVRAM module and drives in a replacement server with Linux software RAID and have the data integrity preserved. But unless/until such things happen write-back caching that preserves the data integrity requires hardware RAID.

Another limitation of Linux software RAID is expanding RAID groups. A HP server that I work on had two disks in a RAID-1 array, one of my colleagues added an extra disk and made it a RAID-5, the hardware RAID device moved the data around as appropriate while the machine was running and the disk space was expanded without any down-time. Some similar things can be done with Linux, for example here is documentation on converting RAID-1 to RAID-5 with Linux software RAID [1]. But that conversion operation requires some down-time and is not something that’s officially supported, while converting RAID-1 to RAID-5 with HP hardware RAID is a standard supported feature.

17 comments to Software vs Hardware RAID

  • I wonder why you didn’t touch on system bus usage. Some thoughts from an amateur on my blog on http://fortytwo.ch/blog/archives/2007/11/#e2007-11-16T10_54_22.txt , but you’re bound to have more experience here.

  • I notice that you don’t mention array scrubbing; it’s not a solution to the bad writes problem, but it does let you detect and fix it before a disk failure.

    In brief, array scrubbing reads all the disks, checks the parity, and rewrites parity when it’s wrong. This has the side-benefit of forcing sector reallocation or even disk failures if a rarely used part of a disk is unreadable.

    Note if you’re looking for this on a hardware RAID, it’s sometimes called checking rather than scrubbing.

  • Ole-Morten Duesund

    My main case for recommending sw-RAID over hw-RAID doesn’t really have anything to do with speed. It’s all about reliability.
    Some day your RAID-card will break. What do you do if it can’t be replaced? Either because you can’t afford a new card, or because the model is no longer produced? Perhaps it only comes in a new SuperDuperRAID variant that doesn’t fit in your server?
    With software RAID this won’t happen. If your disk controller/mainboard breaks, get a new (and improved) one. Plug everything back together and you’re back and running.

    I will however admit that this is mostly on a personal/small-business level. Large businesses/enterprices probably have somewhat different requirements.

  • Olaf van der Spek

    > With any RAID implementation without such a NVRAM cache a write to the entire stripe could be partially successful. This means that the parity block would not match the data! In such a situation the machine would probably work well (fsck would ensure that the filesystem state was consistent) until a disk failed.

    This part doesn’t make sense. Why would it work well until a disk fails?
    Part of the data is corrupt and that could show up directly. If a disk fails, the data is not corrupted any further.

    > If there was a significant quantity of PCs with such NVRAM (or which could be upgraded to it without excessive cost) then there would be an incentive for people to modify the Linux sotware RAID code to use it and thus give benefits for performance and reliability

    Why would the advantage be restricted to software RAID? IMO a huge write-back cache would also be an advantage for normal disks.

    > But unless/until such things happen write-back caching that preserves the data integrity requires hardware RAID.

    Even without write-back caching you need NVRAM to ensure data integrity to avoid partial writes. Even in single-disk volumes.

  • Olaf,

    The reason it would work well until the disk fails is performance. A “chunk” of a RAID 5 consists of data (spread out over n-1 disks) and parity (on the remaining disk). In the absence of failure, you read the data parts of the chunk, and ignore the parity (for performance reasons – there’s no point doing an extra IO when you don’t have to). Thus, if the parity is faulty, but the data is fine, the RAID appears to work normally.

    When one disk fails, the data is rebuilt from parity. At this point, the bad parity results in silently corrupted data. The only “solution” to this is array scrubbing (as I mentioned in my previous comment), which detects the bad parity and rebuilds it *before* the data is lost.

  • Olaf van der Spek

    > The reason it would work well until the disk fails is performance. A “chunk” of a RAID 5 consists of data (spread out over n-1 disks) and parity (on the remaining disk).

    No, that’s not exactly how RAID 5 works. The parity is distributed over all disks, it’s not on a dedicated parity disk. That’s RAID 3 IIRC.

    > In the absence of failure, you read the data parts of the chunk, and ignore the parity (for performance reasons – there’s no point doing an extra IO when you don’t have to).

    You could also read the data parts except one plus the parity.
    But even if you only read data, who guarantees the data is not corrupted?

    > Thus, if the parity is faulty, but the data is fine, the RAID appears to work normally.

    But why would the parity be corrupt and not the data itself?

  • Olaf van der Spek

    > Some day your RAID-card will break. What do you do if it can’t be replaced?

    Restore from backup? RAID is no backup. It’s more about high availability.

    Besides, unless the producer of the card went out of business, I doubt they don’t have a solution for you (if it’s a decent RAID card).

  • etbe

    http://etbe.coker.com.au/2007/11/21/raid-and-bus-bandwidth/
    cmot: Thanks for the suggestion, the above URL will have a response to your comment and post shortly.

    Simon: Scrubbing is good, but is typically only run from a cron job weekly (or some other infrequent interval). Of course if you do an entire RAID rebuild operation after every power failure (as Linux software RAID is prone to do) then you get some consistency at the cost of performance.

    Ole: Auction sites have HP machines with hardware RAID at quite reasonable prices. HP supports them for a minimum of 5 years (within which time they guarantee that they will provide replacement hardware to read the disks). Also it’s not impossible to recreate an unknown RAID format. Some time ago I attended a lecture on how to determine the RAID format when you get a set of disks from an unknown machine. You have to recognise some patterns in the data, a large file with a known format is good for this. Then as there isn’t much variation in RAID formats you just work out the stripe size and the order of the disks and it’s simple to write a program to dump all the data to a single block device or file.

    Olaf: When considering the mathematical issues we consider only a single stripe of a RAID-5 which has one block of data for parity and N-1 blocks of data. The fact that each successive stripe rotates the order by one is not relevant when considering single-stripe issues. Another lacking feature in Linux software RAID is the ability to read all disks and compare the result. Of course when you can’t actually do anything useful once you know the data is bad the deficiency isn’t so bad. It’s a pity that you can’t have a 3 disk mirror and take the majority vote or have RAID-6 read from all disks (with the ability to correct a single block error).

  • Olaf van der Spek

    > When considering the mathematical issues we consider only a single stripe of a RAID-5 which has one block of data for parity and N-1 blocks of data.

    Fair enough, but I don’t see how that relates to my comments.

  • […] through Less WorkAntti-Juhani Kaijanaho on Conditions of Sending EmailOlaf van der Spek on Software vs Hardware RAIDetbe on Software vs Hardware RAIDniq on Conditions of Sending Emailalvaro on Conditions of Sending […]

  • Olaf,

    A RAID system should provide a certain minimum set of guarantees; an important guarantee that the only time when data may be lost is when power goes unexpectedly. In particular, for models without NVRAM, any power loss may induce data loss. For models with NVRAM, data loss may occur if the NVRAM module loses power for more than a specified time (usually 30 days). If you go outside these conditions for any reason, you are advised to check your data, but once you have checked your data, no further corruption should occur.

    The trouble with the situation where the parity is corrupt but not the data is that my data check after power loss shows that I’m A-OK. Some time later (possibly long enough that I’ve forgotten ever losing power), I have a disc failure while online; I hot-swap the drive, which *should* maintain my data (thanks to the data loss guarantee the RAID provides). However, the silently damaged parity means that at *this* stage, when I’m working within the limits of the guarantees RAID provides, I lose data. This is unacceptable behaviour, as the guarantee is broken; note that it *would* be acceptable to corrupt data when the power goes, so long as a data check then shows that there’s been trouble.

  • Olaf van der Spek

    I agree with that text.

    > The trouble with the situation where the parity is corrupt but not the data is that my data check after power loss shows that I’m A-OK.

    What exactly does that check do? After power loss you should do a complete parity check.
    And how did the parity get corrupt (while the data remained intact)?

  • Olaf,

    When you’ve had an unexpected power fail event (whether we’re talking loss of system power without NVRAM cache, or NVRAM battery failure), the RAID is permitted to corrupt data; a *good* RAID implementation guarantees that it will only corrupt areas that have been written to in the last n seconds, for some (documented) value of n. For example, if my disks guarantee that a write has completed when the drive returns completed to a write command (and not that the write has been cached), the RAID may guarantee that only the last 8MB of writes are at-risk, assuming that the data has not been synced to disk (e.g. by fsync).

    Higher levels of software build on these guarantees to avoid data loss, and do things like journalling writes (or BSD-style soft updates), to ensure that once the filesystem claims something’s on disk, it’s safe. In turn, application software like databases or mail servers ensures that it stays within these guarantees, and can do things like replaying journals to ensure that the data is correct, or at least consistent with claims made to the outside world (e.g. a mail server doesn’t give a 200 OK response to incoming mail until it’s confident that the mail is safely stored – fsync is often used on UNIX-likes).

    Parity corruption just means that one disk failed to write all its blocks before power loss, and in this case it happened to be the disk writing parity, not the disk writing data. The trouble with insisting on a complete parity check before you bring the machine back up is that it’s slow; you want to do the parity check in the background while the machine is back in service in many cases (e.g. an outgoing SMTP server, where the RAID stores the outgoing queue), to reduce downtime to a minimum. If you don’t catch the faulty parity before a data drive fails, you risk losing data that you thought was safe, within the guarantees of the RAID system.

  • Really good and really interesting post. I expect (and other readers maybe :)) new useful posts from you!
    Good luck and successes in blogging!

  • Olaf van der Spek

    Isn’t there a huge difference between losing your write back cache (NVRAM failure) and a normal power loss? In the first case, you lose data that the software thinks has already been committed. In the second case, you merely corrupt data that is about to be overwritten.

    > you want to do the parity check in the background while the machine is back in service in many cases

    In that case, you have to verify every read you do until the background check is complete.
    But isn’t this supposed to be caught by replaying journals?

    > within the guarantees of the RAID system.

    I thought you just said a RAID was allowed to lose data on power loss?

  • Olaf,

    There is a difference in terms of the severity of data loss after a cache failure and a power loss, but the basic problem is still the same.

    The RAID is allowed to lose data on power loss. It is also allowed to ignore parity during reads, even if it’s still checking for consistency – thus reads do *not* get verified during the background check (this is beyond the guarantees made by a RAID system, which are all about protecting your data from a drive failure, not a system failure).

    The guarantee is that the only time you can read corrupt data is immediately after a power failure; you are supposed to replay journals, or otherwise validate your data, to ensure that the corruption hasn’t damaged your data. If parity is damaged, *and* the background check does not reach the damaged parity before a drive failure (assuming the drive with the damaged parity block is not the drive that fails), you read correct data (and thus don’t correct it from the journal), but later see it corrupted when the data block is rebuilt from parity.

  • […] Software vs Hardware RAID Related PostsCreate a software-RAID-1 on a Linux system […]