For best system security you want to apply kernel security patches ASAP. For an attacker gaining root access to a machine is often a two step process, the first step is to exploit a weakness in a non-root daemon or take over a user account, the second step is to compromise the kernel to gain root access. So even if a machine is not used for providing public shell access or any other task which involves giving user access to potential hostile people, having the kernel be secure is an important part of system security.
One thing that gets little consideration is the overall effect of applying security updates on overall uptime. Over the last year there have been 14 security related updates (I count a silent data loss along with security issues) to the main Debian Etch kernel package. Of those 14, it seems that if you don’t use DCCP, NAT for CIFS or SNMP, IA64, the dialout group, then you will only need to patch for issues 2, 3 (for SMP machines), 4, 5, 7 (sound drivers get loaded on all machines by default), 9, 10, 11, 12, 13, and 14.
This means 11 reboots a year for SMP machines and 10 a year for uni-processor machines. If a reboot takes three minutes (which is an optimistic assumption) then that would be 30 or 33 minutes of downtime a year due to kernel upgrades. In terms of uptime we talk about the number of “nines”, where the ideal is generally regarded as “five nines” or 99.999% uptime. 33 minutes of downtime a year for kernel upgrades means that you get 99.993% uptime (which is “four nines”). If a reboot takes six minutes (which is not uncommon for servers) then it’s 99.987% uptime (“thee nines”).
While it doesn’t seem likely to affect the number of “nines” you get, not using SMP has the potential to avoid future security issues. So it seems that when using a Xen (or other virtualisation technology) assigning only one CPU to the DomUs that don’t need any more could improve uptime for them.
For Xen Dom0’s which don’t have local users or daemons, don’t use DCCP, NAT for CIFS or SNMP, wireless, CIFS, JFFS2, PPPoE, bluetooth, H.323 or SCTP connection tracking, then only issue 11 applies. However for “five nines” you need to have 5 minutes of downtime a year or less. It seems unlikely that a busy Xen server can be rebooted in 5 minutes as all the DomUs need to have their memory saved to disk (writing out the data to disk and reading it back in after a reboot will probably take at least a couple of minutes) or they need to be shutdown and booted again after the Dom0 is rebooted (which is a good procedure if the security fix affects both Dom0 and DomU use), and such shutdowns and reboots of DomU’s will take a lot of time.
Based on the past year, it seems that a system running as a basic server might get “four nines” if configured for a fast boot (it’s surprising that no-one seems to be talking about recent improvements to the speed of booting as high-availability features) and if the boot is slower then you are looking at “three nines”. For a Xen server unless you have some sort of cluster it seems that “five nines” is unattainable due to reboot times if there is one issue a year, but “four nines” should be easy to get.
Now while the 14 issues over the last year for the kernel seems likely to be a pattern that will continue, the one issue which affects Xen may not be representative (small numbers are not statistically significant). I feel confident in predicting a need for between 5 and 20 kernel updates next year due to kernel security issues, but I would not be prepared to bet on whether the number of issues affecting Xen will be 0, 1, or 4 (it seems unlikely that there would be 5 or more).
I will write a future post about some strategies for mitigating these issues.
Here is my summary of the Debian kernel linux-image-2.6.18-6-686 (Etch kernel) security updates according to it’s changelog, they are not in chronological order, it’s the order of the changelog file:
- 05 Jun 2008: CVE-2008-2358 for DCCP and CVE-2008-1673 for ASN.1 (NAT for CIFS and SNMP).
- 23 May 2008: CVE-2008-2136 memory leak in IPv6 over IPv4 tunnels, CVE-2007-6712 timer related bugs, CVE-2008-1615 ptrace on AMD64 architecture, and CVE-2008-2137 “Validate address ranges regardless of MAP_FIXED”.
- 07 May 2008: CVE-2008-1669 SMP race
- 11 Apr 2008: CVE-2007-6694 PPC only, CVE-2008-0007 Add VM_DONTEXPAND to vm_flags in drivers that register a fault handler but do not bounds check the offset argument, CVE-2008-1294 prevent user escape from RLIMIT_CPU, and CVE-2008-1375 fix dnotify race.
- 10 Feb 2008: CVE-2008-0010 and CVE-2008-0600 Fix missing access check in vmsplice.
- 25 Jan 2008: Not a security issue, but silent data loss on IA64.
- 22 Jan 2008: CVE-2007-6151 ISDN memory overrun, CVE-2008-0001 something related to checking the access to a directory, CVE-2007-2878 FAT filesystem related, CVE-2007-4571 ALSA bug that allows user to read kernel memory.
- 17 Sep 2007: Fix minor DOS attack for slightly privileged users (EG members of dialout group).
- 18 Dec 2007: CVE-2007-6063 overflows in ISDN subsystem, CVE-2007-6206 core dumping over an existing file can get the wrong ownership (should be possible to use kernel.core_pattern to work around this [1]), CVE-2007-5966 timer issue, CVE-2006-6058 Minix fs DOS attack via corrupted fs, and CVE-2007-6417 tmpfs memory leak.
- 29 Nov 2007: CVE-2007-3104 local kernel DOS attack (Oops), CVE-2007-4997 malicious frame on wireless interface crashes system, CVE-2007-5500 potential system hang, and CVE-2007-5904 CIFS overflows from server sending corrupt data.
- 02 Oct 2007: CVE-2007-4573 Xen 64bit with 32bit DomU, CVE-2006-5755 Xen, CVE-2007-4133 memory management DOS, and CVE-2007-5093 DOS when unplugging a webcam that is in use.
- 25 Sep 2007: CVE-2007-3731 ptrace causing Oops, CVE-2007-3739 memory management Oops, CVE-2007-3740 CIFS not honoring umask, CVE-2007-4573 ptrace of 32bit process on AMD64 bug, and CVE-2007-4849 JFFS2 (flash media) filesystem bug.
- 27 Aug 2007: CVE-2007-2172 IPv4 memory related issue (local DOS or compromise?), CVE-2007-2875 local user can read kernel memory if cpuset filesystem is mounted, CVE-2007-3105 buffer overflow in random number generator, CVE-2007-3843 CIFS, and CVE-2007-4308 AAC RAID.
- 11 Aug 2007: CVE-2007-1353 bluetooth, CVE-2007-3513 usblcd, CVE-2007-2525 PPPoE, CVE-2007-3642 H.323 connection tracking, CVE-2007-2172 IPv4 local exploit, CVE-2007-2453 slightly less random numbers, CVE-2007-2876 SCTP connection tracking, CVE-2007-3851 i965 batch buffer usage, and CVE-2007-3848 potential privilege escalation.
Isn’t the five nines thing normally done only counting “unscheduled” downtime? So if you can give your users 24 hours notice, and do the reboot at 6am, then you’re in the clear!
Do you have any statistics on how many of these patches would be able to be applied using ksplice? The user space tools for that seem to be headed into lenny, and I’m looking forward to giving them a go when I get the chance. It oulwd be nice to have the patches that ksplice requires available from their security repository, in order to streamline the process.
For a lot of applications simply having multiple redundant, preferably load balanced, machines is a reasonable proposition – especially if you’re dealing with a heavily virtualised environment.
Where *BSD leads Linux, from what I can see, is dealing with redundant firewalls and routers. I haven’t yet seen a Linux based implementation that is able to synchronise firewall states, let alone SA’s for IPsec, in the manner that *BSD can. I’d be very happy to be proven wrong on that one.
If you are conflating “service” uptime with a given server’s uptime, there’s your problem right there :)
Perhaps the xen solution would be to have a couple of separate Dom0 servers? You migrate domU from one to another (which does have around 150ms downtime, quite acceptable :P), perhaps while usage is lower in the receiving server, then you upgrade the original Dom0, then you migrate back every domU from the other machine to the upgraded, then you upgrade that one, then you rebalance. Does not seem so hard (but it is time consuming indeed!)
Rob: OK, I’d like to schedule the next discovery of a Linux kernel security flaw for no earlier than the 10th of July (the new financial year starts on the 1st of July in Australia so down-time before then is less convenient). ;)
We could of course make it scheduled downtime every night at 11PM until 7AM (as I’ve seen done) but that only applies to some industry sectors.
David: No idea how many of them could be applied with ksplice. If I was to research them in detail I could probably reduce the number that I really want to apply (some of them may turn out to not apply to the exact situation of my servers). Also I could solve the ALSA one by removing sound modules (not needed on servers) and probably use SE Linux to stop a few of the others.
I agree that having redundant machines is a good thing. Then you can take one down for an upgrade while leaving the others running.
I have heard many good things about CARP in OpenBSD and AFAIK Linux can not compete with it. How do FreeBSD and NetBSD compare? I had been led to believe that it was just OpenBSD doing well in that regard.
Jon: If you don’t have redundant servers, then that is the case. I’ll be writing more about this in future posts.
Jisakiel: Migrating DomU’s is one potential solution. I haven’t had a chance to test it yet (but it’s on my todo list).
Of course migrating the DomU would rely on dual-attached storage that doesn’t need an upgrade (IE not a Linux NFS server). Maybe a NetApp filer or a FC RAID array.
One other method for saving time when reboots are required for new kernels is using kexec. This will save you time as the system doesn’t need to do a full BIOS reboot.
etbe: OpenBSD is the only one I know that can do both pfsync (firewall state sync) and sasync (IPsec security association sync). FreeBSD can currently only do the former, I believe. NetBSD I’ve never really touched, so don’t know.
One way to mitigate these types of problems is by applying updates at run-time. There’s a nice paper on implementing this in the Linux kernel from EuroSys 2007:
Dynamic and Adaptive Updates of Non-Quiescent Subsystems in Commodity Operating System Kernels
Kristis Makris, Kyung Dong Ryu
http://citeseerx.ist.psu.edu/viewdoc/summary;jsessionid=01EF71F4576C25F6615F21E3D8A99DC6?cid=5563105
You say: If you don’t have redundant servers, then that is the case. I’ll be writing more about this in future posts.
Well… If you don’t have redundant servers, then you should not be buying marketspeak-grade irreal measures such as the many-nineness.
[…] Russell Coker (Debian) notes that reboots due to kernel security patches result in a non-trivial amount of downtime. […]
[…] Russel Coker have done some math about kernel security related reboots and some common uptime marketing gags. […]