Shelf-life of Hardware

Recently I’ve been having some problems with hardware dying. Having one item mysteriously fail is something that happens periodically, but having multiple items fail in a small amount of time is a concern.

One problem I’ve had is with CD-ROM drives. I keep a pile of known good CD-ROM drives because as they have moving parts they periodically break and I often buy second-hand PCs with broken drives. On each of the last two occasions when I needed a CD-ROM drive I had to try several drives before I found one that worked. It appears that over the course of about a year of sitting on a shelf I have had four CD-ROM drives spontaneously die. I expect drives to die if they are used a lot from mechanical wear, I also expect them to die over time as the system cooling fans suck air through them and dust gets caught. I don’t expect them to stop working when stored in a nice dry room. I wonder whether I would find more dead drives if I tested all my CD and DVD drives or whether my practice of using the oldest drives for machines that I’m going to give away caused me to select the drives that were most likely to die.

Today I had a problem with hard drives. I needed to test a Xen configuration for a client so I took two 20G disks from my pile of spare disks (which were only added to the pile after being tested). Normally I wouldn’t use a RAID-1 configuration for a test machine unless I was actually testing the RAID functionality, it was only the possibility that the client might want to borrow the machine that made me do it. But it was fortunate as one of the disks died a couple of hours later (just long enough to load all the data on the machine). Yay! RAID saved me losing my work!

Then I made a mistake that I wouldn’t make on a real server (I only got lazy because it was a test machine and I didn’t have much risk). I had decided to instead make it a RAID-1 of 30G disks and to save some inconvenience I transfered the LVM from the degraded RAID on the old drive to a degraded RAID on a new disk. I was using a desktop machine and it wasn’t designed for three hard disks so it was easier to transfer the data in a way that doesn’t need to have more than two disks in the machine at any time. Then the new disk died as soon as I had finished moving the LVM data. I could have probably recovered that from the LVM backup data and even if that hadn’t worked I had only created a few LVs and they were contiguous so I could have worked out where the data was.

Instead however I decided to cut my losses and reinstall it all. The ironic thing is that I had planned to make a backup of the data in question (so I would have copies of it on two disks in the RAID-1 and another separate disk), but I had a disk die before I got a chance to make a backup.

Having two disks out of the four I selected die today is quite a bad result. I’m sure that some people would suggest simply buying newer parts. But I’m not convinced that a disk manufactured in 2007 would survive being kept on a shelf for a year any better than a disk manufactured in 2001. In fact there is some evidence that the failure rates are highest when a disk is new.

Apart from stiction I wouldn’t expect drives to cease working from not being used, I would expect drives to last longer if not used. But my rate of losing disks in running machines is minute. Does anyone know of any research into disks dying while on the shelf?

6 comments to Shelf-life of Hardware

  • Paul Archer

    “In fact there is some evidence that the failure rates are highest when a disk is new.”

    If your looking for more details in terms of failures. Look on google for bathtub curve, there is a wikipedia article.

    In terms of your question should shelf life degrade hardware there really is a lack of stats on this stuff. There is only standards avaliable for parts that are used, as this is the most useful data.

  • D. Joe

    “As paper and electrolytic capacitors age their capacitance values drift, they dry out and they become leaky.”

    I’ve had someone suggest that capacitor seals which are fine when warmed under operating conditions will slowly go bad when the equipment is left off.

    The “counterfeit” Nichia capacitor episode affecting for instance the Dell GX280 line has gotten a lot of play, but I expect that this represents accelerated failure due to substandard materials and manufacturing, rather than an entirely different failure mode from what normally happens to capacitor-containing equipment as it ages.

  • Paul Archer

    “As paper and electrolytic capacitors age their capacitance values drift, they dry out and they become leaky.”

    As mentioned this is of a greater concern with electrolytic capacitors. I have not opened up a hard drive, but having electrolytic capacitors in a hard drive would be a bad design due to confined space and increased heat. More likely ceramic/plastic capacitors will be used which will drift less and will also not dry out.

    I do agree however that all types of capacitors will age and may cause devices to fail.

  • etbe

    Paul: In most cases it’s most useful to know how long parts last when used. However when hard disks are used for backups you want the data to remain on them when they are turned off.

    D. Joe: That might explain some of the problems with CD-ROM drives, they (in my brief tests) appeared to not work at all.

    However the hard disks worked to a large extent, they merely started losing data after a while.

  • My personal experience with disk reliability has been so erratic its hard for me to say. However as I pointed out on (and you contested), there are reports that say hard drives have a limited amount of start ups and shot downs. I’ve no personal experience to back that up, I’m only repeating what manufacturers put in their MTBF (mean-time between failure) ratings.

    I did a little research on the subject and found two links from the wikipedia article on hard drives you might appreciate:

  • etbe

    Albert: Thanks for the storagemojo link, that is really interesting. I’m convinced, now I want RAID-6 (or NetApp RAID-DP) for everything that’s really important.

    My experience in this case seems to contradict the startup/shutdown theory. The drives had extensive use in a desktop environment before I got them and therefore had a reasonable number of start/stop cycles. But after a year of inactivity they die. Other disks which had been used in that period didn’t.