I often get reports such as “the server was dead so I rebooted it“. This really doesn’t help me fix the problem, so if the person who uses the server wants reliability (and doesn’t want to be rebooting it and losing data all the time) then more information needs to be provided. Here is a quick list of tests to perform before a reboot if you would like your server not to crash in future:
- Does pressing the CAPS-LOCK key on the keyboard make the CAPS LED light up? If so then the OS isn’t entirely dead.
- What is on the screen of the server (you may have to press a key to get the screen to un-blank)? If it’s a strange set of numbers then please photograph them if possible, I might understand what they mean. If you don’t have a camera with high enough resolution to capture them then please make a note of some of the messages. Don’t write down numbers – they are not useful enough to be worth the effort. Write down words, including special words such as OOM and pairs of words seperated by a “_” character.
If the “server” is a Xen virtual machine then save the contents of the console (as described in my previous post [1]). - Can you ping the machine (usually by ping servername)? If so then networking is basically operational.
- Are the hard drive access lights indicating heavy use? If so then it might be thrashing due to excessive memory use (maybe a DOS attack).
- Can you login at the console? If so please capture the output of free, ps auxf, and netstat -tn.
- If the machine offers TCP services (almost all servers do) then use the telnet command to connect to the service port and make a note of what happens. For example to test a mail server type “telnet server 25” and if all goes well you expect to see “220 some message from the mail server“, note how long it takes for such a message to be displayed. Some protocols don’t send a message on a connect, for example with HTTP (the protocol used by web servers) you have to enter a few characters and press ENTER to get a response (usually some sort of HTTP error message).
Finally please don’t tell me that the server is too important and that the users couldn’t wait for you to perform any tests before rebooting it. If the server is important then it is important that it doesn’t crash repeatedly. A crash may even be caused by something that could cause data loss (EG hardware that is failing) or something that could incur extra expense if not fixed quickly (EG failing hardware that will be out of warranty soon). You have to tell users that the choice is to wait for an extra few minutes or risk having another crash tomorrow with further data loss.
If the server is important enough for it to be worth my time to try and fix it then it’s important enough to have these tests performed before the reboot.
I used to have a script that would do most of that, and checked its network connectivity, and had it run from /etc/inittab as the ctrlaltdel script.
we had more than a few reboots done because our ISP’s uplink was down/slow, and “the internet was down, so i rebooted it” was a common result.