Recently I’ve been having some problems with hardware dying. Having one item mysteriously fail is something that happens periodically, but having multiple items fail in a small amount of time is a concern.
One problem I’ve had is with CD-ROM drives. I keep a pile of known good CD-ROM drives because as they have moving parts they periodically break and I often buy second-hand PCs with broken drives. On each of the last two occasions when I needed a CD-ROM drive I had to try several drives before I found one that worked. It appears that over the course of about a year of sitting on a shelf I have had four CD-ROM drives spontaneously die. I expect drives to die if they are used a lot from mechanical wear, I also expect them to die over time as the system cooling fans suck air through them and dust gets caught. I don’t expect them to stop working when stored in a nice dry room. I wonder whether I would find more dead drives if I tested all my CD and DVD drives or whether my practice of using the oldest drives for machines that I’m going to give away caused me to select the drives that were most likely to die.
Today I had a problem with hard drives. I needed to test a Xen configuration for a client so I took two 20G disks from my pile of spare disks (which were only added to the pile after being tested). Normally I wouldn’t use a RAID-1 configuration for a test machine unless I was actually testing the RAID functionality, it was only the possibility that the client might want to borrow the machine that made me do it. But it was fortunate as one of the disks died a couple of hours later (just long enough to load all the data on the machine). Yay! RAID saved me losing my work!
Then I made a mistake that I wouldn’t make on a real server (I only got lazy because it was a test machine and I didn’t have much risk). I had decided to instead make it a RAID-1 of 30G disks and to save some inconvenience I transfered the LVM from the degraded RAID on the old drive to a degraded RAID on a new disk. I was using a desktop machine and it wasn’t designed for three hard disks so it was easier to transfer the data in a way that doesn’t need to have more than two disks in the machine at any time. Then the new disk died as soon as I had finished moving the LVM data. I could have probably recovered that from the LVM backup data and even if that hadn’t worked I had only created a few LVs and they were contiguous so I could have worked out where the data was.
Instead however I decided to cut my losses and reinstall it all. The ironic thing is that I had planned to make a backup of the data in question (so I would have copies of it on two disks in the RAID-1 and another separate disk), but I had a disk die before I got a chance to make a backup.
Having two disks out of the four I selected die today is quite a bad result. I’m sure that some people would suggest simply buying newer parts. But I’m not convinced that a disk manufactured in 2007 would survive being kept on a shelf for a year any better than a disk manufactured in 2001. In fact there is some evidence that the failure rates are highest when a disk is new.
Apart from stiction I wouldn’t expect drives to cease working from not being used, I would expect drives to last longer if not used. But my rate of losing disks in running machines is minute. Does anyone know of any research into disks dying while on the shelf?