Everyone who does any significant amount of sysadmin work will break a server. Most people who have any significant experience will have broken several. Anyone who has never broken one should be treated with suspicion by other members of the sysadmin team, they probably haven’t learned the caution that most of us learn from stuffing something up badly.
When you break a server there are some things that you can do to greatly mitigate the scale of the disaster. Firstly carefully watch what happens, for example don’t type “rm -rf *” and then walk away, watch the command – if it takes unusually long then press ^C and double-check that you are removing the right files. Fixing a half broken server is often easier than fixing one that has been broken properly and completely!
If you don’t know what to do then do nothing! Doing the wrong thing can make things worse. Seek advice from the most experienced sysadmin who is available. If backups are inadequate then leave the server in a broken state while seeking advice, particularly if you were working at the end of the day and it doesn’t need to be up for a while. There is often time to seek advice by email before doing something.
Do not reboot the system! Even if your backups are perfect there is probably some important data that has been recently modified and can be salvaged. Certain types of corruption (such as filesystem metadata corruption) will leave good data in memory where it can be recovered.
Do not logout! If you do then you may not be able to login again and this may destroy your chances of fixing the system or recovering data.
Do not terminate any programs that you have running. There have been more than a few instances where the first step towards recovering from disaster involved using an open editor session (such as vi or emacs). If the damage prevents an editor from being used (EG by removing the editor’s program or a shared object it relies on) then having one running is very important.
These procedures are particularly important if you are unable to visit the server. For example when using a hosted server at an ISP the more cost effective plans give you no option to ever gain physical access. So plugging the disks into another machine for recovery is not an option.
Does anyone have any other suggestions as to what to do in such a catastrophy?
PS This post is related to the fact that I had to recover the last couple of weeks of blog comments and posts from Google’s cache…
Update: I got my data back, not I have to copy one day of blog stuff from one database to another.
Michael Janke wrote:
I’m not sure if it’s a suggestion, but I can relate a story on how we recovered from an rm -rf of /devices on an E10k, using Legato from an idle SSH session.
I was surprised it worked, and scared enough of the next reboot that we simply left it running for the 6mos or so that it took for our new 25K’s to get installed.
–Mike
Career switching.
You are right. The first thing to do is “DO NOT PANIC”. Once, I’ve typed ‘rm -rf /etc’ instead of “rm -rf etc”. I had recovered the machine by restoring /etc from latest backup to anoter machine and used scp to get the files to the damaged machine.
I’ve learned 3 things:
* When you do someting bad, don’t panic. In my case even a logout may have lead to a worst situation.
* Don’t use a root account. Use sudo when needed.
* Do backups.
This is such a great post. The first paragraph is especially right. Everyone breaks servers. How you deal with it separates the sysadmins from the…uhh…lesser sysadmins.
[…] Some advice from Russell Coker for server administrators: http://etbe.coker.com.au/2009/11/13/when-you-break-a-server/ […]
This is actually quite relevant to those pesky real life situations at work when some servers go fubar by themselves. One such event where panic would destroy everything was once when a totally forgotten server (no backups, of course) lost it’s hard disk in such a way that everything was readable, but only once. Every command, every file, everything was destroyed while reading.
Of course this was not apparent at first, so i wasted a bunch of good programs while trying to figure out what’s happening. Lost a lot of relevant libraries to quite many softwares normally used for emergency backups. Thankfully linux is filled with file transferring capable softwares. Finally i managed with great effort to export most of the data out of that piece of junk. Had no idea what programs actually were installed, so i had to create half a dozen different plans with complete syntaxes and a map of dependencies, literally on paper, before i could actually try to extract anything useful.
Hey, i did a similar bad thing recently when a makefile went awry!
woops. i documented what i did if it interests you:
http://dazzle.cs.mcgill.ca/wordpress/?p=18
I’m always suspicious of scripts that contain rm. Particularly if it is followed by -f. More so if it is followed by a variable. I try to always ask myself “what if this variable isn’t set?”
Running scripts with set -u mitigates against this kind of mistake:
rm -Rf /${SOMEUNSETVARIABLE}