Reliability of RAID

ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.

Of course most systems in the field aren’t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it’s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).

For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn’t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.

Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of “consumer” disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with “consumer” disks that it is very improbable that I’m not getting such errors. So the question is, how can I discover such errors and fix them?

In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn’t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn’t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn’t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won’t know which disk is failing.

Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there’s only 470G in use!

Table of Contents

The BTRFS Filesystem

The btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!

I’ve just done a quick test of this. I created a filesystem with the command “mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?” and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem – that worked correctly although I was disappointed that it didn’t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command “btrfs scrub start .” and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command “btrfs scrub status .” just gave me a count of the corrected errors and didn’t mention which device had the errors.

It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.

BTRFS is rumored to have performance problems, I will test this but don’t have time to do so right now. Anyway I’m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.

BTRFS and Xen

The system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.

This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.

BTRFS and DRBD

The combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.

It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!

Comparing Risks

I don’t want to use BTRFS in production now because of the risk of bugs. While it’s unlikely to have really serious bugs it’s theoretically possible that as bug could deny access to data until kernel code is fixed and it’s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it’s almost certain that I will lose small amounts of data and it’s most probable that I have silently lost data on many occasions without realising.

Jonathan Angliss

February 6, 2012 at 03:03

Interesting post. I had this weird recollection that I had read something about end of life of RAID before. Interestingly by the same author, predicting the end of RAID5. Matt Simmons covered it a while back…

http://www.standalone-sysadmin.com/blog/2008/10/is-raid-5-a-risk-with-higher-drive-capacities/

Steven Chamberlain

February 7, 2012 at 04:34

I recently lost a server to btrfs. Hetzner migrated a Debian Squeeze Xen guest between host systems, and upon completion all filesystem accesses hung. After having to do a forced reboot, the kernel (or a newer 3.x kernel) would hang / flood printk trying to mount the filesystem. Then after booting from a rescue system the btrfs tools would segfault trying to read it. The new fsck tool crashed a lot but managed to recover *some* data by running the tool multiple times. I’m reluctant to trust it again anytime soon.

For most purposes I’ve gone back to the ever-reliable reiserfs (v3) which has never, ever let me down. Even when I once screwed up a RAID-5 reconstruction from the wrong disk, which had held an older copy of that RAID set. The resulting filesystem was recoverable with the standard reiserfs tools (in-place, without reformat), except for data that had been added/modified since that older copy.

I really don’t care for performance if it may sacrifice the robustness of the filesystem and ability to recover from media errors. Claimed increases in performance seem to come at a price now, such as the ext4 ‘delalloc’ fiasco and having to modify userland tools like dpkg to be able to write data safely again.

Things sound much happier in ZFS land, with RAID handled within the filesystem itself, checksumming of all data (bad reads will instead return a good copy from another volume), and reconstruction that only has to process the blocks that are allocated/corrupted. Its Wikipedia page mentions the issue of silent media errors when storage reaches higher capacities. I’ve been meaning to try it someday.

Nick J

February 7, 2012 at 14:13

There’s no email reporting by mdadm of inconsistencies? That’s disappointing … makes me wonder then why most of the install steps you see online get you to set up and test email reporting, if it’s not going to actually do anything useful. Come to think of it, I’ve never had mdadm send me any mail, even when a raid-1 array became degraded. If you have set up email reporting, then you’ve already self-selected as a person that wants to get emails whenever there’s a problem with the raid array.

Adam Skutt

February 7, 2012 at 22:15

The first article has a problem: BER should normally be better than stated, not worse, most of the time. To achieve the published BER at a reasonable confidence level across all of your drives, you have to produce drives that are better than that level most of the time, and slightly worse only occasionally. Presumably, really bad drives are caught by QA and tossed and almost never enter the product stream. More importantly, there have been rumors that the published BER level, especially for consumer drives, is not the factory targeted rate. The factory targets a rate better than what’s published.

When looking at empirical survey data of these things, one has to be very, very careful to control for environmental factors. Run the disks outside their stated temperature range, for example, and errors do go up. Studies have to be careful to control for these events. Also, I’m not sure there’s been a comprehensive study into the long term effects of environmental problems, like how much shorter life is if you drop a disk once, or if it overheats for just one hour.

The LSE defintion used by the NetApp study is very problematic. UREs cannot be lumped together with other types of errors. Some bad areas on a disk are expected, and drives have the necessary redundancy to handle the issue.

What’s always amazed me, at the consumer level, is that many people don’t use all of their hard disk space (or frequently even half). Duplicating everything, even on the same disk, with checksums, would be a cheap and easy way to cover many failure cases.

etbe

February 7, 2012 at 23:15

Steven: A hostile user who can control file content (IE write any files) can put in place data on a ReiserFS filesystem that will result in SUID files appearing after a fsck rebuild-tree operation. In my tests I couldn’t get an exploit via this approach, but I’m not good at developing exploits. BTRFS is not regarded as ready for production at the moment. But I will probably try it on one of my less important servers soon.

Nick: Well my bug report was well received by the DD, so in Debian at least we should get these things solved soon.

Adam: I wouldn’t count on the BER being a lot better than what is quoted. I know of one manufacturer that had a routine practice of refurbishing drives that were returned under warranty and then giving them to other people. You have to expect the worst from a company that does that sort of thing.

Also in terms of reliability it seems reasonable to assume that NetApp devices are better supported than most systems in that regard. No-one is going to pay for an expensive NetApp device and then mistreat it the way some of my servers are mistreated. For some of my servers I can’t even monitor the temperature!

Running BTRFS RAID-1 on the same disk is something I am considering, I just suggested that on a sysadmin mailing list shortly before your comment and there’s currently a debate going about it.

February 8, 2012 at 00:08

I don’t see why I should believe that refurbished drives mean the BER must be as stated. Hard drives are complicated devices, and plenty fail for reasons that can be repaired by the factory. Assuming the statistics are properly computed, they account for such things anyway. Most companies that make goods with a warranty are likely to send you a refurbished device when the original fails. Refurbished drives are still covered by the balance of the warranty, so you’re basically saying you don’t trust HDD manufacturers at all. Plainly, I’d believe the BER on consumer-grade drives to be stated lower than reality simply to make them look worse versus their higher-end siblings. It’s not like there’s substantial manufacturing differences between the two to begin with.

And I don’t see why I should believe that people protect their NetApp devices simply because they’re “expensive”. Plenty of companies end up with cooling, power, or other failures simply because they didn’t properly plan their server rooms and capacity needs. NetApps aren’t so expensive to make people build a brand new data center for them; you need IBM mainframes or the like to get that to happen. Besides, I don’t really see what that has to do with my objection to the error metric they used. There are problems with focusing on just UREs, but there are arguably more problems with their metric. With UREs, at least we know data loss has occurred. Plus, it’s pretty hard to believe that they didn’t intentionally choose that metric to tout their features of their products.

Nick, I have seen email in the event of actual MD RAID-1 failure. That particular drive had a controller failure of some kind and refused to talk to the system anymore whatsoever after the failure.

February 8, 2012 at 00:22

Hard drives are fragile. If I have one die during the warranty period then I would like to get a new one as a replacement not one that has possibly been mistreated by someone else. If a drive can be repaired to an as-new condition then they should repair it and send it back to the person who originally owned it instead of sending one user’s problems to another user.

NetApps are expensive enough that almost all of them will end up in proper server rooms, as opposed to all the 1RU and 2RU servers that end up in closets and under desks. Not to mention the desktop systems installed as servers.

February 8, 2012 at 00:32

Either you trust them to properly recertify the drive or you don’t. If you don’t trust their ability to QA a failed drive then there’s no rational reason to trust their ability to QA a brand new drive, either, which is the problem I have with what you’re saying. I presume they’re not sending people back their original drive since they’re likely being sent overseas for repair and evaluation.

As far as the server room thing, I know plenty of companies who’ve thought they had a proper serve room until a failure happened, or someone ignored the plan and forced more stuff into the room anyway, or they were forced to move, etc. Plus, when there are problems, they’re not likely to be discussed externally for a host of reasons.

February 8, 2012 at 14:03

> I know of one manufacturer that had a routine practice of refurbishing
> drives that were returned under warranty and then giving them to other
> people.

IBM used to do it, before they sold the business to Hitachi (returned disks had a “certified repaired part” sticker on them) – I asked about this and they said by law they had to put the sticker on (heaven forbid they just sent out new disk). Seagate still does it – I have 2 disks here that have a green disk sticker with “Certified Repaired HDD” on them, and then one of them died just out of the replacement warranty period.

I think the better question is: which disk manufacturers do not send out repaired disks? Are Western Digital or Samsung any better?

> Nick, I have seen email in the event of actual MD RAID-1 failure.

Doh! Just tested, and this was my fault. I changed ISPs and the outgoing smarthost was still set to the old ISP’s, so outgoing mail was backlogged.

So yes, you should in fact get a mail with a subject like so: “DegradedArray event on /dev/md0”, and mailing can be tested with: sudo mdadm –monitor –test –oneshot /dev/md0

> I presume they’re not sending people back their original drive

No, they’re not – they make a big song and dance about how you won’t get back the same drive, and for a seagate one, I just checked the replacement drive’s details against the original’s, and it’s not the same serial number, and not even the same model number (it’s a similar model number, and the disk has the same interface and capacity). Looks like the returned ones go into a big pool and are reused as spare parts for repaired disks if possible.

> If you don’t trust their ability to QA a failed drive then there’s
> no rational reason to trust their ability to QA a brand new drive, either

But they might have lower standards, and it’s probably in their financial interest to do so. If the warranty on a new drive is 5 years, and then you have drive fail after 4 years (which has happened to me), then they have no incentive to return you a repaired drive that will last more than 1 year, since beyond that period you’re out of warranty. Furthermore, even if the replacement subsequently fails in the warranty period, since the manufacturer is sending out repaired parts, and you’ll know this after the first replacement, your incentive to get it replaced is reduced, since you’ll suspect that the 2nd replacement is more likely to fail too. Because of this, I’ve come to feel the warranties on drives are relatively useless. It’d be different if I knew at the time of purchase that I’d get a new drive in the event of a failure.

February 8, 2012 at 22:58

I’m not sure why people think it’s trivial to just make a hard drive that last less longer with any sort of regularity or consistency, or that it’s trivial to figure out when a particular drive will fail. Even if the QA on failed drives is somehow lower, it’s not years lower. Manufacturers also have a huge incentive to properly refurbish their drives: repeated failures would open them up to considerable legal action. Heck, sufficient widespread failure of the original equipment is enough legal action, you don’t want to be routinely selling product that fails within the warranty period. You definitely don’t want to replace a failed product with something else that fails in the warranty period.

February 8, 2012 at 23:45

Legal action is really expensive. I recently got legal advice about a dispute, the lawyer owed me a favor so he wrote a letter for free but he advised me that if the other party to the dispute didn’t just give in then I should forget about it. Amounts less than $10,000 aren’t worth legal action due to the costs – which aren’t always paid by the loser. The amount in question was equal to the cost of three SATA disks, fortunately the other guy just sent me a cheque and didn’t dispute it.

When the amount of computer gear is worth legal action it still doesn’t happen. Any time a company buys a $1,000,000 computer there are a variety of pressures on management not to escalate the dispute. My experience is that the Trade Practices Act is routinely violated by vendors of expensive gear and none of their customers complain.

One of my friends bought an IBM drive when they were going through a bad period. He had it replaced with refurbished disks three times before he requested a smaller disk (from a different series and therefore without the same defects).

The only reason for getting into a legal battle over a small amount of money is when there’s a political battle and the money doesn’t matter. EG the Linux users who sue for a refund of a MS license fee.

Really I can’t understand why drive manufacturers don’t just use returned drives as scrap metal and supply new disks. The costs of postage both ways would be a significant portion of the manufacture price of a new disk. So it should be better for them economically to just send a new disk and get the matter resolved. Posting drives back and forth three times is just bad for business. The economics of corporations is to not do special cases, they have an assembly line and run it as fast as possible. Refurbished gear is a special case and only makes sense if the gear is really expensive, and drives just aren’t that expensive.

February 9, 2012 at 00:45

I think you misunderstood me when I said “considerable”. Systematic failure to honor warranties is the sort of thing that subjects you to class action lawsuits (which did happen to IBM), investigations by the US FTC (and similar agencies in other jurisdictions), and the like. Past behavior by certain industries in this regard is why many jurisdictions have “lemon laws” and other explicit statues about warranties. It’s probably reasonable to assume a refurbished drive is of lower quality. It’s not the least bit reasonable to assume that means that QA is being entirely disregarded on the refurbished drives. Until recently, it was the standard policy of all the manufacturers to send you a brand new drive if your drive failed within a month (and it’s still policy for enterprise drives, I believe). That’s not the policy of a company that’s out to get you.

You’re also fundamentally wrong about the economics in these situations. Regardless of the quality of a product, all manufacturers must be prepared to accept returns for products sold at retail. As such, the cost of bearing returned and failed products ends up being spread out over all the devices: if you get 1/1000 returned, and it costs X to refurbish it, then you add X/1000 to cost. It may be a special case, but it’s also not a special case companies can ignore. Unless your product is really cheap (and hard drives are not really cheap), it makes no sense to throw away returns if you can resell them or offer them as warranty replacements, even if it costs you some money to do so. As long as the cost is lower than the resale point, or the cost of a new device (for warranty replacements), it’s economically sensible to do so.

Postage costs don’t change the situation either. They only have to pay postage one way (in the US, at least) and they ship the drives using the cheapest method available. They negotiate pricing deals with the shipping companies. Cost of shipping buying a drive from newegg to my house is $6.25 (guaranteed three day), so the manufacturers are doing as well, if not better, on warranty returns. As a point of reference, Newegg sells refurbished drives for around $30-50 dollars each, while the cheapest new drives is about $70 (which is admittedly price distorted due to the flooding). That’s probably workable for the HDDs manufacturers.

Nikos

February 9, 2012 at 00:54

Unrecoverable read errors are known to the disk and to the OS, so your worry that random data will be treated as good is misplaced. Of course, cosmic rays or little green men can change your bits in flight, but the probability of that is much lower. Of course checksumming would help with that, but it’s not as gloomy as you make it, I think.

Alex Chekholko

February 9, 2012 at 03:45

When you talk about the problems with RAID at the beginning of your post, you seem to be conflating two very different error modes!

When we say “read error”, we mean “error: could not read data”, and we don’t mean “data returned is silently corrupted”. It is the former case that RAID is good for, since then you just read the data from the other disk (in RAID-1) or reconstruct from parity (in RAID-5). Your NetApp reference is talking about the former kind of error, but you interpret it as the latter kind of error.

The latter case is exceedingly rare, and is usually not reported as an error, it’s just silent data corruption. And if you think that your disk is returning randomly corrupted data, you’ve got bigger problems, and you shouldn’t trust any single disk anywhere.

The more expensive enterprise solutions are starting to offer “verify on read” for RAID arrays with parity, or in their filesystem (e.g. Isilon), but there’s a performance hit.

February 9, 2012 at 11:09

Write holes in RAID implementations always create the possibility of incorrect data being written to disk. When a failure occurs, you run the risk of the incorrect data being used in the rebuild, causing silent corruption. This could happen even with theoretically perfect disks that never have read errors.

Additionally, controller problems that cause a RAID array to be marked as failed (even when the disks are OK) can cause rebuild problems and data loss.

March 13, 2012 at 17:28

http://neil.brown.name/blog/20100211050355

Neil Brown’s article about smart or simple RAID recovery is worth reading.

April 27, 2012 at 17:11

http://etbe.coker.com.au/2012/04/27/btrfs-zfs-layering-violations/

I wrote the above post in response to a LWN comment that referenced this post.

etbe – Russell Coker

Archives

Categories

Reliability of RAID

The BTRFS Filesystem

BTRFS and Xen

BTRFS and DRBD

Comparing Risks

17 comments to Reliability of RAID

Archives

Email and RSS

etbe – Russell Coker

Archives

Categories

Tags

Reliability of RAID

The BTRFS Filesystem

BTRFS and Xen

BTRFS and DRBD

Comparing Risks

17 comments to Reliability of RAID

Archives

Email and RSS