Last night I was in the middle of checking my email when I found that clicking on a URL link wouldn’t work. It turned out that my web browser had become unavailable due to a read error on the partition for my root filesystem (the usual IDE uncorrectable error thing). My main machine is a Thinkpad T41p, it is apparently possible to replace the CD-ROM drive with a second hard drive to allow RAID-1 but I haven’t felt inclined to spend the money on that. So any hard drive error is a big problem.
Fortunately I had made a backup of /home only a few days ago. I use offline IMAP for my email so that my recent email (the most variable data that matters to me) is stored on a server with a RAID-1 as well as on my laptop and my netbook. The amount of other stuff I’ve been working on in my home directory is fairly small, and the amount of that which isn’t on other systems is even smaller (I usually build packages on servers and then scp the relevant files to my laptop for Debian uploads, bug reports, etc.
The first thing I did was to ssh to one of my servers and paste a bunch of text from various open programs into a file there. That was the contents of all open programs, the URLs of web pages I was reading, and the contents of an OpenOffice spread-sheet which I couldn’t save directly (it seems that a read-only /tmp will prevent OpenOffice from saving anything). Then I used scp to copy 600M of ted.com videos that I hadn’t backed up, I don’t usually backup such things but I don’t want to download them twice if I can avoid it (I only have a quota of 25G per month).
After that I made new backups of all filesystems starting with /home. I then used tar to backup the root filesystem.
The hard drive in the laptop only had a single bad sector, so I could have re-written it so that it would be remapped (as I have done before with that disk), but I think that on a 5yo disk it’s probably best to replace it. I had been thinking of installing a larger disk anyway.
On restore I restored the root filesystem from a month-old backup and then used “diff -r” to discover what had changed, it took me less than an hour to merge the changes from the corrupted root filesystem to the restored one.
Now I have lots of free disk space and no data loss!
I am now considering making an automated backup system for /home. My backup method is to make an LVM snapshot of the LV which is used and then copy that – this gets the encrypted data so I can safely store it on USB devices while traveling. I could easily write a cron job that uses scp to transfer a backup to one of my servers at some strange time of the night.
The next issue is how many other disks I will lose this summer. I have installed many small mail server and Internet gateway systems running RAID-1, it seems most likely that some of them will have dead disks with the expected record temperatures this summer.