Everyone who does any significant amount of sysadmin work will break a server. Most people who have any significant experience will have broken several. Anyone who has never broken one should be treated with suspicion by other members of the sysadmin team, they probably haven’t learned the caution that most of us learn from stuffing something up badly.
When you break a server there are some things that you can do to greatly mitigate the scale of the disaster. Firstly carefully watch what happens, for example don’t type “rm -rf *” and then walk away, watch the command – if it takes unusually long then press ^C and double-check that you are removing the right files. Fixing a half broken server is often easier than fixing one that has been broken properly and completely!
If you don’t know what to do then do nothing! Doing the wrong thing can make things worse. Seek advice from the most experienced sysadmin who is available. If backups are inadequate then leave the server in a broken state while seeking advice, particularly if you were working at the end of the day and it doesn’t need to be up for a while. There is often time to seek advice by email before doing something.
Do not reboot the system! Even if your backups are perfect there is probably some important data that has been recently modified and can be salvaged. Certain types of corruption (such as filesystem metadata corruption) will leave good data in memory where it can be recovered.
Do not logout! If you do then you may not be able to login again and this may destroy your chances of fixing the system or recovering data.
Do not terminate any programs that you have running. There have been more than a few instances where the first step towards recovering from disaster involved using an open editor session (such as vi or emacs). If the damage prevents an editor from being used (EG by removing the editor’s program or a shared object it relies on) then having one running is very important.
These procedures are particularly important if you are unable to visit the server. For example when using a hosted server at an ISP the more cost effective plans give you no option to ever gain physical access. So plugging the disks into another machine for recovery is not an option.
Does anyone have any other suggestions as to what to do in such a catastrophy?
PS This post is related to the fact that I had to recover the last couple of weeks of blog comments and posts from Google’s cache…
Update: I got my data back, not I have to copy one day of blog stuff from one database to another.