Early this morning the server that stores my email (which had 93 days uptime) had a filesystem related problem. The root filesystem became read-only and then the kernel message log filled with unrelated messages so there was no record of the problem. I’m now considering setting up rsyslogd to log the kernel messages to a tmpfs filesystem to cover such problems in future. As RAM is so cheap it wouldn’t matter if a few megs of RAM were wasted by that in normal operation if it allowed me to extract useful data when something goes really wrong. It’s really annoying to have a system in a state where I can login as root but not find out what went wrong.
After that I tried 2 kernels in the 3.14 series, both of which had kernel BUG assertions related to Xen networking and failed to network correctly, I filed Debian Bug #756714. Fortunately they at least had enough uptime for me to run a filesystem scrub which reported no errors.
Then I reverted to kernel 3.13.10 but the reboot to apply that kernel change failed. Systemd was unable to umount the root filesystem (maybe because of a problem with Xen) and then hung the system instead of rebooting, I filed Debian Bug #756725. I believe that if asked to reboot a system there is no benefit in hanging the system with no user space processes accessible. Here are some useful things that systemd could have done:
- Just reboot without umounting (like “reboot -nf” does).
- Pause for some reasonable amount of time to give the sysadmin a possibility of seeing the error and then rebooting.
- Go back to a regular runlevel, starting daemons like sshd.
- Offer a login prompt to allow the sysadmin to login as root and diagnose the problem.
Options 1, 2, and 3 would have saved me a bit of driving. Option 4 would have allowed me to at least diagnose the problem (which might be worth the drive).
Having a system on the other side of the city which has no remote console access just hang after a reboot command is not useful, it would be near the top of the list of things I don’t want to happen in that situation. The best thing I can say about systemd’s operation in this regard is that it didn’t make the server catch fire.
Now all I really know is that 3.14 kernels won’t work for my server, 3.13 will cause problems that no-one can diagnose due to lack of data, and I’m now going to wait for it to fail again. As an aside the server has ECC RAM and it’s hardware is known to be good, so I’m sure that BTRFS is at fault.