On Sunday night I started the process of upgrading the LUV server to Debian/Jessie from Debian/Wheezy. My initial plan was to just upgrade Apache first but dependencies required upgrading systemd too.
One problem I’ve encountered in the past is that the Wheezy version of systemd will often hang on an upgrade to a newer version. Generally the solution to this is to run “systemctl daemon-reexec” from another terminal. The problem in this case was that not all the libraries needed for systemd had been installed, so systemd could re-exec itself but immediately aborted. The kernel really doesn’t like it when process 1 aborts repeatedly and apparently immediately hanging is the result. At the time I didn’t know this, all I knew was that my session died and the server stopped responding to pings immediately after I requested a reexec.
The LUV server is hosted at VPAC for free. As their staff have actual work to do they couldn’t spend a lot of time working on the LUV server. They told me that the screen was flickering and suspected a VGA cable. I got to the VPAC server room with the spare LUV server (LUV had been given 3 almost identical Sun servers from Barwon Water) at 16:30. By 17:30 I had fixed the core problem (boot with “init=/bin/bash“, mount the root filesystem rw, finish the upgrade of systemd and it’s dependencies, and then reboot normally). That got it into a stage where the Xen server for Wikimedia Au was working but most LUV functionality wasn’t working.
By 23:00 on Monday I had the full list server functionality working for users, this is the main feature that users want when it’s not near a meeting time. I can’t remember whether it was Monday night or Tuesday morning when I got the Drupal site going (the main LUV web site). Last night at midnight I got the last of the Mailman administrative interface going, I admit I could have got it going a bit earlier by putting SE Linux in permissive mode, but I don’t think that the members would have benefited from that (I’ll upload a SE Linux policy package that gets Mailman working on Jessie soon).
Now it’s Wednesday and I’m still fixing some cron jobs. Along the way I noticed some problems with excessive disk space use that I’m fixing now and I’ve also removed some Wikimedia related configuration files that were obsolete and would have prevented anyone from using a wikimedia.org.au address to subscribe to the LUV mailing lists.
Now I believe that everything is working correctly and generally working better than before.
While Sunday night wasn’t a bad time to start the upgrade it wasn’t the best. If I had started the upgrade on Monday morning there would have been less down-time. Another possibility might be to do the upgrade while near the VPAC office during business hours, I could have started the upgrade while at a nearby cafe and then visited the server room immediately if something went wrong.
Doing an upgrade on a day when there’s no meeting within a week was a good choice. It wasn’t really a conscious choice as I’m usually doing other LUV work near the meeting day which precludes doing other LUV work that doesn’t need to be done soon. But in future it would be best to consciously plan upgrades for a date when users aren’t going to need the service much.
While the Wheezy systemd bug is unlikely to ever be fixed there are work-arounds that shouldn’t result in a broken server. At the moment it seems that the best option would be to kill -9 the systemctl processes that hang until the packages that systemd depends on are installed. The problem is that the upgrade hangs while the new systemctl tries to tell the old systemd to restart daemons. If we can get past that to the stage where the shared objects are installed then it should be ok.
The Apache upgrade from 2.2.x to 2.4.x changed the operation of some access control directives and it took me some time to work out how to fix that. Doing a Google search on the differences between those would have led me to the Apache document about upgrading from 2.2 to 2.4 . That wouldn’t have prevented some down-time of the web sites but would have allowed me to prepare for it and to more quickly fix the problems when they became apparent. Also the rather confusing configuration of the LUV server (supporting many web sites that are no longer used) didn’t help things. I think that removing cruft from an installation before an upgrade would be better than waiting until after things break.
Next time I do an upgrade of such a server I’ll write notes about it while I go. That will give a better blog post about it if it becomes newsworthy enough to be blogged about and also more opportunities to learn better ways of doing it.
Sorry for the inconvenience.