Efficiency of Cooling Servers

One thing I had wondered was why home air-conditioning systems are more efficient than air-conditioning systems for server rooms. I received some advice on this matter from the manager of a small server room (which houses about 30 racks of very powerful and power hungry servers).

The first issue is terminology, the efficiency of a “chiller” is regarded as the number of Watts of heat energy removed divided by the number of Watts of electricity consumed by the chiller. For example when using a 200% efficient air cooling plant, a 100W light bulb is rated as being a 150W heat source. 100W to Heat it, 50W from the cooling plant to cool it.

For domestic cooling I believe that 300% is fairly common for modern “split systems” (it’s the specifications for the air-conditioning on my house and the other air-conditioners on display had similar ratings). For high-density server rooms with free air cooling I have been told that a typical efficiency range is between 80% and 110%! So it’s possible to use MORE electricity on cooling than on running the servers!

One difficulty in cooling a server room is that the air often can’t flow freely (unlike a big open space such as the lounge room of your house). Another is the range of temperatures and the density of heat production in some parts (a 2RU server can dissipate 1000W of heat in a small space). These factors can be minimised by extracting hot air at the top and/or rear of racks and forcing cold air in the bottom and/or the front and by being very careful when planning where to place equipment. HP offers some services related to designing a server room to increase cooling efficiency, one of the services is using computational fluid dynamics to simulate the air-flow in the server-room [1]! CFD is difficult and expensive (the complete package from HP for a small server room costs more than some new cars), I believe that the fact that it is necessary for correct operation of some server rooms is an indication of the difficulty of the problem.

The most effective ways of cooling servers involve tight coupling of chillers and servers. This often means using chilled water or another liquid to extract the heat. Chilled water refrigeration systems largely remove the problem of being unable to extract the heat from the right places, but instead you have some inefficiency in pumping the water and the servers are fixed in place. I have not seen or heard of chilled water being used for 2RU servers (I’m not saying that it doesn’t get used or that it wouldn’t make sense – merely that I haven’t seen it). When installing smaller servers (2RU and below) there is often a desire to move them and attaching a chilled-water cooling system would make such a move more difficult and expensive. When a server weighs a ton or more then you aren’t going to move it in a hurry (big servers have to be mostly disassembled before the shell can be moved, and the shell might require the efforts of four men to move it). Another issue related to water cooling is the weight. Managing a moderate amount of water involves a lot of heavy pipes (a leak would be really bad) and the water itself can weigh a lot. A server room that is based around 20Kg servers might have some issues with the extra weight of water cooling (particularly the older rooms), but a server room designed for a single rack that weighs a ton can probably cope.

I have been told that the cooling systems for low density server rooms are typically as efficient as those used for houses, and may even be more efficient. I expect that when designing an air-conditioner the engineering trade-offs when designing for home use favor low purchase price. But someone who approves the purchase of an industrial cooling system will be more concerned about the overall cost of operations and will be prepared to spend some extra money up-front and recover it over the course of a few years. The fact that server rooms run 24*7 also gives more opportunity to recover the money spent on the purchase (my home A-C system runs for about 3 months a year for considerably less than 24 hours a day).

So it seems that the way to cool servers efficiently is to have low density server rooms (to the largest extent possible). One step towards this goal would be to have servers nearer the end users. For example having workgroup servers near the workgroup (instead of in the server room). Of course physical security of those servers would be more challenging – but if all the users have regular desktop PCs that can be easily 0wned then having the server for them in the same room probably doesn’t make things any worse. Modern tower servers are more powerful than rack mounted servers that were available a few years ago while also being very quiet. A typical rack-mounted server is not something you would want near your desk, but one of the quiet tower servers works quite well.

13

Improving Blog Latency to Benefit Readers

I just read an interesting post about latency and how it affects web sites [1]. The post has some good ideas but unfortunately mixed information on some esoteric technologies such as infiniband that are not generally applicable with material that is of wide use (such as ping times).

The post starts by describing the latency requirements of Amazon and stock broking companies. It’s obvious that stock brokers have a great desire to reduce latency, it’s also not surprising that Google and Amazon analyse the statistics of their operations and make changes to increase their results by a few percent. But it seems to be a widely held belief that personal web sites are exempt from such requirements. The purpose of creating content on a web site is to have people read it, if you can get an increase in traffic of a few percent by having a faster site and if those readers refer others then it seems likely to have the potential to significantly improve the result. Note that an increase in readership through a better experience is likely to be exponential, and an exponential increase of a few percent a year will eventually add up (an increase of 4% a year will double the traffic in 18 years).

I have been considering hosting my blog somewhere else for a while. My blog is currently doing about 3G of traffic a month which averages out to just over 1KB/s, peaks will of course be a lot greater than that and the 512Kb/s of the Internet connection would probably be a limit even if it wasn’t for the other sites onn the same link. The link in question is being used for serving about 8G of web data per month and there is some mail server use which also takes bandwidth. So performance is often unpleasantly slow.

For a small site such as mine the most relevant issues seem to be based around available bandwidth, swap space use (or the lack therof), disk IO (for when things don’t fit in cache) and available CPU power exceeding the requirements.

For hosting in Australia (as I do right now) bandwidth is a problem. Internet connectivity is not cheap in any way and bandwidth is always limited. Also the latency of connections from Australia to other parts of the world often is not as good as desired (especially if using cheap hosting as I currently do).

According to Webalizer only 3.14% of the people who access my blog are from Australia, they will get better access to my site if hosted in Australia, and maybe the 0.15% of people who access my blog from New Zealand will also benefit from the locality of sites hosted in Australia. But the 37% of readers who are described as “US Commercial” (presumably .com) and the 6% described as “United States” (presumably .us) will benefit from US hosting, as will most of the 30% who are described as “Network” (.net I guess).

For getting good network bandwidth it seems that the best option is to choose what seems to be the best ISP in the US that I can afford, where determining what is “best” is largely based on rumour.

One of the comments on my post about virtual servers and swap space [2] suggested just not using swap and referenced the Amazon EC2 (Elastic Computing) cloud service and the Gandi.net hosting (which is in limited beta and not generally available).

The Amazon EC2 clound service [3] has a minimum offering of 1.7G of RAM, 1EC2 Compute Unit (equivalent to a 1.0-1.2GHz 2007 Opteron or 2007 Xeon processor), 160G of “instance storage” (local disk for an instance) running 32bit software. Currently my server is using 12% of a Celeron 2.4GHz CPU on average (which includes a mail server with lots of anti-spam measures, Venus, and other things). Running just the web sites on 1EC2 Compute Unit should use significantly less than 25% of a 1.0GHz Opteron. I’m currently using 400M of RAM for my DomU (although the MySQL server is in a different DomU). 1.7G of RAM for my web sites is heaps even when including a MySQL server. Currently a MySQL dump of my blog is just under 10M of data, with 1.7G of RAM the database should stay entirely in RAM which will avoid the disk IO issues. I could probably use about 1/3 of that much RAM and still not swap.

The cost of EC2 is $US0.10 per hour of uptime (for a small server), so that’s $US74.40 per month. The cost for data transfer is 17 cents a GIG for sending and 10 cents a gig for receiving (bulk discounts are available for multiple terabytes per month).

I am not going to pay $74 per month to host my blog. But sharing that cost with other people might be a viable option. An EC2 instance provides up to 5 “Elastic IP addresses” (public addresses that can be mapped to instances) which are free when they are being used (there is a cost of one cent per hour for unused addresses – not a problem for me as I want 24*7 uptime). So it should be relatively easy to divide the costs of an EC2 instance among five people by accounting for data transfer per IP address. Hosting five web sites that use the same software (MySQL and Apache for example) should reduce memory use and allow more effective caching. A small server on EC2 costs about five times more than one of the cheap DomU systems that I have previously investigated [4] but provides ten times the RAM.

While the RAM is impressive, I have to wonder about CPU scheduling and disk IO performance. I guess I can avoid disk IO on the critical paths by relying on caching and not doing synchronous writes to log files. That just leaves CPU scheduling as a potential area where it could fall down.

Here is an interesting post describing how to use EC2 [5].

Another thing to consider is changing blog software. I currently use WordPress which is more CPU intensive than some other options (due to being written in PHP), is slightly memory hungry (PHP and MySQL), and doesn’t have the best security history. It seems that an ideal blog design would use a language such as Java or PHP for comments and use static pages for the main article (with the comments in a frame or loaded by JavaScript). Then the main article would load quickly and comments (which probably aren’t read by most users) would get loaded later.

6

Killing Servers with Virtualisation and Swap

The Problem:

A problem with virtual machines is the fact that one rogue DomU can destroy the performance of all the others by inappropriate resource use. CPU scheduling is designed to allow reasonable sharing of computational resources, it is unfortunately not well documented, the XenSource wiki currently doesn’t document the “credit” scheduler which is used in Debian/Etch and CentOS 5 [1]. One interesting fact is that CPU scheduling in Xen can have a significant effect on IO performance as demonstrated in the paper by Ludmila Cherkasova, Diwaker Gupta and Amin Vahdat [2]. But they only showed a factor of two performance difference (which while bad is not THAT bad).

A more significant problem is managing virtual memory, when there is excessive paging performance can drop by a factor of 100 and even the most basic tasks become impossible.

The design of Xen is that every DomU is allocated some physical RAM and has it’s own swap space. I have previously written about my experiments to optimise swap usage on Xen systems by using a tmpfs in the Dom0 [3]. The aim was to have every Xen DomU swap data out to a tmpfs so that if one DomU was paging heavily and the other DomUs were not paging then the paging might take place in the Dom0’s RAM and not hit disk. The experiments were not particularly successful but I would be interested in seeing further research in this area as there might be some potential to do some good.

I have previously written about the issues related to swap space sizing on Linux [4]. My conclusion is that following the “twice RAM” myth will lead to systems becoming unusable due to excessive swapping in situations where they might otherwise be usable if the kernel killed some processes instead (naturally there are exceptions to my general rule due to different application memory use patterns – but I think that my general rule is a lot better than the “twice RAM” one).

One thing that I didn’t consider at the time is the implications of this problem for Xen servers. If you have 10 physical machines and one starts paging excessively then you have one machine to reboot. If you have 10 Xen DomUs on a single host and one starts paging heavily then you end up with one machine that is unusable due to thrashing and nine machines that deliver very poor disk read performance – which might make them unusable too. Read performance can particularly suffer in a situation when one process or VM is writing heavily to disk due to the way that the disk queuing works, it’s not uncommon for an application to read dozens or hundreds of blocks from disk to satisfy a single trivial request from a user, if each of these block read requests has to wait for a large amount of data to be written out from the write-back cache then performance will suck badly (I have seen this in experiments on single disks and on Linux software RAID – but have not had the opportunity to do good tests on a hardware RAID array).

Currently for Xen DomUs I am allocating swap spaces no larger than 512M, as anything larger than that is likely to cause excessive performance loss to the rest of the server if it is actually used.

A Solution for Similar Problems:

A well known optimisation technique of desktop systems is to use a separate disk for swap, in desktop machines people often use the old disk as swap after buying a new larger disk for main storage. The benefit of this is that swap use will not interfere with other disk use, for example the disk reads needed to run the “ps” and “kill” programs won’t be blocked by the memory hog that you want to kill. I believe that similar techniques can be applied to Xen servers and give even greater benefits. When a desktop machine starts paging excessively the user will eventually take a coffee break and let the machine recover, but when an Internet server such as a web server starts paging excessively the requests keep coming in and the number of active processes increases so it seems likely that using a different device for the swap will allow some processes to satisfy requests by reading data from disk while some other processes are waiting to be paged in.

Applying it to Xen Servers:

The first thing that needs to be considered for such a design is the importance of reliable swap. When it comes to low-end servers there is ongoing discussion about the relative merits of RAID-0 and RAID-1 for swap. The benefit of RAID-0 is performance (at least in perception – I can imagine some OS swapping algorithms that could potentially give better performance on RAID-1 and I am not aware of any research in this area). The benefit of RAID-1 is reliability. Now there are two issues in regard to reliability, one is continuity of service (EG being able to hot-swap a failed disk while the server is running), and the other is the absence of data loss. For some systems it may be acceptable to have a process SEGV (which I presume is the result if a page-in request fails) due to a dead disk (reserving the data loss protection of RAID for files). One issue related to this is the ability to regain control of a server after a problem. For example if the host OS of a machine had non-RAID swap then a disk failure could prevent a page-in of data related to sshd or some similar process and thus make it impossible to recover the machine without hardware access. But if the swap for a virtual machine was on a non-RAID disk and the host had RAID for it’s swap then the sysadmin could login to the host and reboot the DomU after creating a new swap space on a working disk.

Now if you have a server with 8 or 12 disks (both of which seem to be reasonably common capacities of modern 2RU servers) and if you decide that RAID is not required for the swap space of DomUs then it would be possible to assign single disks for swap spaces for groups of virtual machines. So if one client had several virtual machines they could have them share the same single disk for the swap, so a thrashing server would only affect the performance of other VMs from the same client. One possible configuration would be a 12 disk server that has a four disk RAID-5 array for main storage and 8 single disks for swap. 8 CPU cores is common for a modern 2RU server, so it would be possible to lock 8 groups of DomUs so that they share CPUs and swap spaces. Another possibility would be to have four groups of DomUs where each group had a RAID-1 array for swap and two CPU cores.

I am not sure of the aggregate performance impact of such a configuration, I suspect that a group of single disks would give better performance for swap than a single RAID array and that RAID-1 would outperform RAID-5. For a single DomU it seems most likely that using part of a large RAID array for swap space would give better performance. But the benefit in partitioning the server seems clear. An architecture where each DomU had it’s own dedicated disk for a swap space is something that I would consider a significant benefit if renting a Xen DomU. I would rather have the risk of down-time (which should be short with hot-swap disks and hardware monitoring) in the rare case of a disk failure than have bad performance regularly in the common situation of someone else’s DomU being overloaded.

Failing that, having a separate RAID array for swap would be a significant benefit. If every process that isn’t being paged out could deliver full performance while one DomU was thrashing then it would be a significant benefit over the situation where any DomU can thrash and kill the file access performance of all other DomUs. A single RAID-1 array should handle all the swap space requirements for a small or medium size Xen server

One thing that I have not tested is the operation of LVM when one disk goes bad. In the case of a disk with bad sectors it’s possible to move the LVs that are not affected to other disks and to remove the LV that was affected and re-create it after removing the bad disk. The case of a disk that is totally dead (IE the PV header can’t be read or written) might cause some additional complications.

Update Nov 2012: This post was discussed on the Linode forum:

Comments include “The whole etbe blog is pretty interesting” and “Russell Coker is a long-time Debian maintainer and all-round smart guy” [5]. Thanks for that!

3

Ownership of the Local SE Linux Policy

A large part of the disagreement about the way to manage the policy seems to be based on who will be the primary “owner” of the policy on the machine. This isn’t a problem that only applies to SE Linux, the same issue applies for various types of configuration files and scripts throughout the process of distribution development. Having a range of modules which can be considered configuration data that come from a single source seems to make SE Linux policy unique among other packages. The reasons for packaging all Apache modules in the main package seem a lot clearer.

One idea that keeps cropping up is that as the policy is modular it should be included in daemon packages and the person maintaining the distribution package of the policy should maintain it. The reason for this request seems to usually be based on the idea that the person who packages a daemon for a distribution knows more about how it works than anyone else, I believe that this is false in most cases. When I started working on SE Linux I had a reasonable amount of experience in maintaining Debian packages of daemons and server processes, but I had to learn a lot about how things REALLY work to be able to write good policy. Also if we were to have policy modules included in the daemon packages, then those packages would need to be updated whenever there were serious changes to the SE Linux policy. For example Debian/Unstable flip-flopped on MCS support recently, changing the policy packages to re-enable MCS was enough pain, getting 50 daemon packages updated would have been unreasonably painful. Then of course there is the case where two daemons need to communicate, if the interface which is provided with one policy module has to be updated before another module can be updated and they are in separate packages then synchronised updates to two separate packages might be required for a single change to the upstream policy. I believe that the idea of having policy modules owned by the maintainers of the various daemon packages is not viable. I also believe that most people who package daemons would violently oppose the idea of having to package SE Linux policy if they realised what would be required of them.

Caleb Case seems to believe that ownership of policy can either be based on the distribution developer or the local sys-admin with apparently little middle-ground [1]. In the section titled “The Evils of Single Policy Packages” he suggests that if an application is upgraded for a security fix, and that upgrade requires a new policy, then it requires a new policy for the entire system if all the policy is in the same package. However the way things currently work is that upgrading a Debian SE Linux policy package does not install any of the new modules. They are stored under /usr/share/selinux/default but the active modules are under /etc/selinux/default/modules/active. An example of just such an upgrade is the Debian Security Advisory DSA-1617-1 for the SE Linux policy for Etch to address the recent BIND issue [2]. In summary the new version of BIND didn’t work well with the SE Linux policy, so an update was released to fix it. When the updated SE Linux policy package is installed it will upgrade the bind.pp module if the previous version of the package was known to have the version of bind.pp that didn’t allow named to bind() to most UDP ports – the other policy modules are not touched. I think that this is great evidence to show that the way things currently work in Debian work well. For the hypothetical case where a user had made local modifications to the bind.pp policy module, they could simply put the policy package on hold – I think it’s safe to assume that anyone who cares about security will read the changelogs for all updates to released versions of Debian, so they would realise the need to do this.

Part of Caleb’s argument rests on the supposed need for end users to modify policy packages (IE to build their own packages from modified source). I run many SE Linux machines, and since the release of the “modular” policy (which first appeared in Fedora Core 5, Debian/Etch, and Red Hat Enterprise Linux 5) I have never needed to make such a modification. I modify policy regularly for the benefit of Debian users and have a number of test machines to try it out. But for the machines where I am a sysadmin I just create a local module that permits the access that is needed. The only reason why someone would need to modify an existing module is to remove privileges or to change automatic domain transition rules. Changing automatic domain transitions is a serious change to the policy which is not something that a typical user would want to do – if they were to do such things then they would probably grab the policy source and rebuild all the policy packages. Removing privileges is not something that a typical sysadmin desires, the reference policy is reasonably strict and users generally don’t look for ways to tighten up the policy. In almost all cases it seems best to consider that the policy modules which are shipped by the distribution are owned by the distribution not the sysadmin. The sysadmin will decide which policy modules to load, what roles and levels to assign to users with the semanage tool, and what local additions to add to the policy. For the CentOS systems I run I use the Red Hat policy, I don’t believe that there is a benefit for me to change the policy that Red Hat ships, and I think that for people who have less knowledge about SE Linux policy than me there are more reasons not to change such policy and less reasons to do so.

Finally Caleb provides a suggestion for managing policy modules by having sym-links to the modules that you desire. Of course there is nothing preventing the existence of a postfix.pp file on the system provided by a package while there is a local postfix.pp file which is the target of the sym-link (so the sym-link idea does not support the idea of having multiple policy packages). With the way that policy modules can be loaded from any location, the only need for sym-links is if you want to have an automatic upgrade script that can be overridden for some modules. I have no objection to adding such a feature to the Debian policy packages if someone sends me a patch.

Caleb also failed to discuss how policy would be initially loaded if packaged on a per-module basis. If for example I had a package selinux-policy-default-postfix which contains the file postfix.pp, how would this package get installed? I am not aware of the Debian package dependencies (or those of any other distribution) being about to represent that the postfix package depends on selinux-policy-default-postfix if and only if the selinux-policy-default package is installed. Please note that I am not suggesting that we add support for such things, a package management system that can solve Sudoku based on package dependency rules is not something that I think would be useful or worth having. As I noted in my previous post about how to package SE Linux policy for distributions [3] the current Debian policy packages have code in the postinst (which I believe originated with Erich Schubert) to load policy modules that match the Debian packages on the system. This means that initially setting up the policy merely requires installing the selinux-policy-default package and rebooting. I am inclined to reject any proposed change which makes the initial install of of the policy more difficult than this.

After Debian/Lenny is released I plan to make some changes to the policy. One thing that I want to do is to have a Debconf option to allow users to choose to automatically upgrade their running policy whenever they upgrade the Debian policy package, this would probably only apply to changes within one release (IE it wouldn’t cause an automatic upgrade from Lenny+1 policy to Lenny+2). Another thing I would like to do is to have the policy modules which are currently copied to /etc/selinux/default/modules/active instead be hard linked when the source is a system directory. That would save about 12M of disk space on some of my systems.

I’ve taken the unusual step of writing two blog posts in response to Caleb’s post not because I want to criticise him (he has done a lot of good work), but because he is important in the SE Linux community and his post deserves the two hours I have spent writing responses to it. While writing these posts I have noticed a number of issues that can be improved, I invite suggestions from Caleb and others on how to make such improvements.

10

SE Linux Policy Packaging for a Distribution

Caleb Case (Ubuntu contributer and Tresys employee) has written about the benefits of using separate packages for SE Linux policy modules [1].

Firstly I think it’s useful to consider some other large packages that could be split into multiple packages. The first example that springs to mind is coreutils which used to be textutils, shellutils, and fileutils. Each of those packages contained many programs and could conceivably have been split. Some of the utilities in that package are replaced for most use, for example no-one uses the cksum utility, generally md5sum and sha1sum (which are in the same package) are used instead. Also the pinky command probably isn’t even known by most users who use finger instead (apart from newer Unix users who don’t even know what finger is). So in spite of the potential benefit of splitting the package (or maintaining the previous split) it was decided that it would be easier for everyone to have a single package. The merge of the three packages was performed upstream, but there was nothing preventing the Debian package maintainer from splitting the package – apart from the inconvenience to everyone. The coreutils package in Etch takes 10M of disk space when installed, as it’s almost impossible to buy a new hard drive smaller than 80G that doesn’t seem to be a problem for most users.

The second example is the X server which has separate packages for each video card. One thing to keep in mind about the X server is that the video drivers don’t change often. While it is quite possible to remove a hard drive from one machine and install it in another, or duplicate a hard drive to save the effort of a re-install (I have done both many times) they are not common operations in the life of a system. Of course when you do require such an update you need to first install the correct package (out of about 60 choices), which can be a challenge. I suspect that most Debian systems have all the video driver packages installed (along with drivers for wacom tablets and other hardware devices that might be used) as that appears to be the default. So it seems likely that a significant portion of the users have all the packages installed and therefore get no benefit from the split package.

Now let’s consider the disk space use of the selinux-policy-default package – it’s 24M when installed. Of that 4.9M is in the base.pp file (the core part of the policy which is required), then there’s 848K for the X server (which is going to be loaded on all Debian systems that have X clients installed – due to an issue with /tmp/.ICE-unix labelling [2]). Then there’s 784K for the Postfix policy (which is larger than it needs to be – I’ve been planning to fix this for the past four years or so) and 696K for the SSH policy (used by almost everyone). The next largest is 592K for the Unconfined policy, the number of people who choose not to use this will be small, and as it’s enabled by default it seems impractical to provide a way of removing it.

One possibility for splitting the policy is to create a separate package of modules used for the less common daemons and services, if modules for INN, Cyrus, distcc, ipsec, kerberos, ktalk, nis, PCMCIA, pcscd, RADIUS, rshd, SASL, and UUCP were in a separate package then that would reduce the installed size of the main package by 1.9M while providing no change in functionality to the majority of users.

One thing to keep in mind is that each package at a minimum will have a changelog and a copyright file (residing in a separate directory under /usr/share/doc) and three files as part of the dpkg data store, each of which takes up at least one allocation unit on disk (usually 4K). So adding one extra package will add at least 24K of disk space to every system that installs it (or 32K if the package has postinst and postrm scripts). This is actually a highly optimal case, the current policy packages (selinux-policy-default and selinux-policy-mls) each take 72K of disk space for their doc directory.

One of my SE Linux server sytems (randomly selected) has 23 policy modules installed, if they were in separate packages there would be a minimum of 552K of disk space used by packaging, 736K if there were postinst and postrm scripts, and as much as 2M if the doc directory for each package was similar to the current doc directories). As the system in question needs 5796K of policy modules, the 2M of overhead would make it approach 8M of disk space. So it would only be a saving of 16M over the current situation. While saving that amount of disk space is a good thing, I think that when balanced against the usability issues it’s not worth-while.

Currently the SE Linux policy packages will determine what applications are installed and automatically load policy packages to match. I don’t believe that it’s possible to have a package post-inst script install other packages (and if it is possible I don’t think it’s desirable). Therefore to have separate packages would make a significant difference to the ease of use, it seems that the best way to manage it would be to have the core policy package include a script to install the other packages.

Finally there’s the issue of when you recognise the need for a policy module. It’s not uncommon for me to do some work for a client while on a train, bus, or plane journey. I will grab packages needed to simulate a configuration that the client desires and then work out how to get it going correctly while on the journey. While it would not be a problem for me (I always have the SE Linux policy source and all packages on hand) I expect that many people who have similar needs might find themself a long way from net access without the policy package that they need to do their work. Sure such people could do their work in permissive mode, but that would encourage them to deploy in permissive mode too and thus defeat the goals of the SE Linux project (in terms of having wide-spread adoption).

My next post on this topic will cover the issue of custom policy.

Updated to note that Caleb is a contributor to Ubuntu not a developer.

4

SpamAssassin During SMTP

For some time people have been telling me about the benefits of SpamAssassin (SA). I have installed it once for a client (at their demand and against my recommendation) but was not satisfied with the result (managing the spam folder was too complex for their users).

The typical configuration of SA has it run after mail has been accepted by the server. Messages that it regards as spam are put into a spam folder. This means that when someone phones you about some important message you didn’t receive then you have to check that folder. Someone who sends mail to a user who has such a SA configuration can not expect that the message will either be received or rejected (thus giving them a bounce message).

Even worse it seems to be quite common for technical users to train the Bayesian part of SA on messages from the spam folder – without reviewing them! Submitting a folder of spam that has been carefully reviewed for Bayesian training can increase the accuracy of classification (including taking account for locality and language differences in spam). Submitting a folder which is not reviewed means that when a false-positive gets into that folder (which will eventually happen) it is used as training for spam recognition thus increasing the incidence of false-positives!

Spam has been becoming more of a problem for me recently, on a typical day between 20 and 40 spam messages would get past the array of DNSBL services I use and be re-sent to pass the grey-listing. Also I have been receiving complaints from people who want to send email to me about some of the DNSBL and RHSBL services I use (the rfc-ignorant.org service gets a lot of complaints – there are a huge number of ignorant and lazy people running mail servers).

So now I have installed spamassassin-milter to have SA run during the SMTP protocol. Then if SA checks indicate that the message is SPAM my mail server can just reject the message with a 55x which will cause the sending mail server to generate a local bounce (if it’s a legitimate message) or to just be discard it in the case of a spam server. Here is how to set it up on Debian/Lenny and CentOS 5:

Install the package yum install spamass-milter or apt-get install spamass-milter spamassassin spamc (spamassassin seems to be installed by default on CentOS). On a Debian system the milter will be setup and running. On CentOS you have to run the following commands:
useradd -m -c "Spamassassin Milter" -s /bin/false spamass-milter
mkdir /var/run/spamass-milter
chown spamass-milter /var/run/spamass-milter
chmod 711 /var/run/spamass-milter
echo SOCKET="/var/run/spamass-milter/spamass.sock" >> /etc/sysconfig/spamass-milter

On CentOS edit /etc/init.d/spamass-milter and change the daemon start line to ‘runuser – spamass-milter -s /bin/bash -c "/usr/sbin/spamass-milter -p $SOCKET -f $EXTRA_FLAGS"‘ Then add the following lines below it:
chown postfix:postfix /var/run/spamass-milter/spamass.sock
chmod 660 /var/run/spamass-milter/spamass.sock

The spamass-milter program talks to the SpamAssassin daemon spamd.

On both Debian and CentOS run the command “useradd -c Spamassassin -m -s /bin/false spamassassin” to create an account for SA. The Debian bug #486914 [1] has a request to have SA not run as root by default.

On CentOS it seems that SA wants to use a directory under the spamass-milter home directory, the following commands alllow this. It would be good to have it not do that, or maybe it would be better to have the one Unix account used for SA and the milter.
chmod 711 ~spamass-milter
mkdir ~spamassassin/.spamassassin
chown spamassassin ~spamassassin/.spamassassin

On Debian edit the file /etc/default/spamassassin and add “-u spamassassin -g spamassassin” to the OPTIONS line. On CentOS edit the file /etc/sysconfig/spamassassin and add “-u spamassassin -g spamassassin” to the SPAMDOPTIONS line.

To enable the daemons, on CentOS you need to run “chkconfig spamass-milter on ; chkconfig spamassassin on“, on Debian edit the file /etc/default/spamassassin and set ENABLED=1.

Now start the daemons, on CentOS use the command “service spamassassin start ; service spamass-milter start“, on Debian use the command “/etc/init.d/spamassassin start“.

Now you have to edit the mail server configuration, for Postfix on CentOS the command “postconf -e smtpd_milters=unix:/var/run/spamass-milter/spamass.sock” will do it, for Postfix on Debian the command “postconf -e smtpd_milters=unix:/var/spool/postfix/spamass/spamass.sock” will do it.

Now restart Postfix and it should be working.

For correct operation you need to ensure that the score needed for a bounce is specified as the same number in both the spamass-milter and SA configuration. If you have a lower number for the spamass-milter configuration (as is the default in Debian) then bounces can be generated – you should never generate a bounce for a spam. The config file /etc/default/spamass-milter allows you to specify the score for rejecting mail, I am currently using a score of 5. Any changes to the score need matching changes to /etc/mail/spamassassin/local.cf (which has a default required_score of 5 in Debian).

You can grep for “spamd..result..Y” in your mail log to see entries for messages that were rejected.

One problem that I have with this configuration on Debian (not on CentOS) is that spamd is logging messages such as “spamd: handle_user unable to find user: ‘russell’“. I don’t want it to look for ~russell when processing mail for russell@coker.com.au because I have a virtual domain set up and the delivery mailbox has a different name. Ideally I could configure it to know the mapping between users and mailboxes (maybe by parsing the /etc/postfix/virtual.db file). But having it simply not attempt to access per-user configuration would be good too. Any suggestions would be appreciated.

Now that I have SpamAssassin running it seems that I am getting about 5 spams a day, the difference is significant. The next thing I will do is make some of the DNSBL checks that are prone to false-positives become SpamAssassin scores instead.

When I started writing this post I was not planning to compare the sys-admin experiences of CentOS and Debian. But it does seem that there is less work involved in the task of installing Debian packages.

3

Executable Stacks in Lenny

One thing that I would like to get fixed for Lenny is the shared objects which can reduce the security of a system. Almost a year ago I blogged about the libsmpeg0 library which is listed as requiring an executable stack [1]. I submitted a two-line patch which fixes the problem while making no code changes (the patch gives the same result as running “execstack -c” on the resulting shared object).

My previous post documents the results of the problem when running SE Linux (a process is not permitted to run and an AVC message is logged). Some people might incorrectly think that this is merely a SE Linux functionality issue.

The program paxtest (which is in Debian but is i386 only) tests for a variety of kernel security features in terms of memory management. To demonstrate the problem that is caused by this issue I ran the commands “paxtest kiddie” and “LD_PRELOAD=/usr/lib/libsmpeg-0.4.so.0 paxtest kiddie“. The difference is that the test named “Executable stack” returns a result of Vulnerable when the object is loaded.

This means for example that attacks which rely on an executable stack will be permitted if the libsmpeg-0.4.so.0 shared object is loaded. So for example a program that loads the library and which takes data from the Internet (EG FreeCiv in network mode) will become vulnerable to attacks which rely on an executable stack because of this bug!

My Etch SE Linux repository has had a libsmpeg0 package which fixes this bug on i386 for almost a year [2] (the AMD64 packages are more recent). I have now added packages to fix this bug to my Lenny SE Linux repository [3]. I have also volunteered to NMU the package for Lenny. It seems that it would be rather embarrassing for everyone concerned systems were vulnerable to attack because of a two-line patch not being applied for almost a year.

I expect that the Release Team will be very accepting of package updates for Lenny which have patches to address this issue. A patch that has one line per assembler file (in the worst-case) to mark the object code is very easy to review. The results of the patch can be tested easily, and failure to have such a patch opens potential security holes. Package maintainers who can’t fix the assembly code can always run “execstack -c” in the build scripts to give the same result.

Lintian performs checks for executable stacks and the results are archived here [4]. There are currently 36 packages which contain binaries listed as needing executable stacks, I would be surprised if more than 6 of them actually contain shared objects that need an executable stack. If you use a package that is on that list then please test whether an executable stack is required by running “execstack -c” on the shared object and see if it still works. If a test of most of the high-level operations of the program in question can be completed successfully without an executable stack then it’s a strong indication that it’s not needed. Note that execstack is in the prelink package. I am happy to help with writing the patches to the packages and using my repositories to distribute the packages, but am not going to do so unless I can work with someone who uses the program in question and can test it’s functions. As an example of such testing I played a game of Frozen Bubble to test out the libsmpeg0 patch.

Xen CPU use per Domain

The command “xm list” displays the number of seconds of CPU time used by each Xen domain. This makes it easy to compare the CPU use of the various domains if they were all started at the same time (usually system boot). But is not very helpful if they were started at different times.

I wrote a little Perl program to display the percentage of one CPU that has been used by each domain, here is a sample of the output:

Domain-0 uses 7.70% of one CPU
demo uses 0.06% of one CPU
lenny32 uses 2.07% of one CPU
unstable uses 0.30% of one CPU

Now the command “xm top” can give you the amount of CPU time used at any moment (which is very useful). But it’s also good to be able to see how much is being used over the course of some days of operation. For example if a domain is using the equivalent of 34% of one CPU over the course of a week (as one domain that I run is doing) then it makes sense to allocate more than one VCPU to it so that things don’t slow down at peak times or when cron jobs are running.

I believe that it’s best to limit the number of CPUs allocated to Xen domains. For example I am running a Xen server with 8 CPU cores. I could grant each domain access to 8 VCPUs, but then any domain could use all that CPU power if it ran wild. While if I give no domain more than 2 VCPUs then one domain could use all the CPU resources allocated to it without the other domains being impacted. I realise that there are scheduling algorithms in the Xen kernel that are designed to deal with such situations, but I believe that simply denying access to excessive resource use is more effective and reliable.

I have not filed a bug report requesting that my script be added to one of the Xen packages as I’m not sure which one it would belong in (and also it’s a bit of a hack). It’s licensed under the GPL so anyone who wants to use it can do what they want. Any distribution package maintainer who wants to include it in a Xen utilities package is welcome to do so. The code is below.
Continue reading

8

A New Strategy for Xen MAC Allocation

When installing Xen servers one issue that arises is how to assign MAC addresses. The Wikipedia page about MAC addresses [1] shows that all addresses that have the second least significant bit of the most significant byte set to 1 are “locally administered”. In practice people just use addresses starting with 02: for this purpose although any number congruent to two mod four used in the first octet would give the same result. I prefer to use 02: because it’s best known and therefore casual observers will be more likely to realise what is happening.

Now if you have a Xen bridge that is private to one Dom0 (for communication between Xen DomU’s on the same host) or on a private network (a switch that connects servers owned by one organisation and not connected to machines owned by others) then it’s easy to just pick MAC addresses starting with 02: or 00:16:3e: (the range assigned to the Xen project). But if Xen servers run by other people are likely to be on the same network then there is a problem.

Currently I’m setting up some Xen servers that have public and private networks. The private network will either be a local bridge (that doesn’t permit sending data out any Ethernet ports) or a bridge to an Ethernet port that is connected to a private switch, for that I am using MAC addresses starting with 02:. As far as I am aware there is no issue with machine A having a particular MAC address on one VLAN while machine B has the same MAC address on another VLAN.

My strategy for dealing with the MAC addresses for the public network at the moment is to copy MAC addresses from machines that will never be in the same network. For example if I use the MAC addresses from Ethernet cards in a P3 desktop system running as a router in a small company in Australia then I can safely use them in a Xen server in a co-location center in the US (there’s no chance of someone taking the PCI ethernet cards from the machine in Australia and sending them to the US – and no-one sells servers that can use such cards anyway). Note that I only do this when I have root on the machine in question and where there is no doubt about who runs the machine, so there should not be any risk.

Of course if someone from the ISP analyses the MAC addresses on their network it will look like they have some very old machines in their server room. ;)

I wonder if there are any protocols that do anything nasty with MAC addresses. I know that IPv6 addresses can be based on the MAC address, but as long as the separate networks have separate IPv6 ranges that shouldn’t be a problem. I’m certainly not going to try bridging networks between Australia and the US!

Another possible way of solving this issue would be to have the people who run a server room assign and manage MAC addresses. One way of doing this would be to specify a mapping of IP addresses to MAC addresses, EG you could have the first two bytes be 02:00: and the next four be the same as the IPv4 address assigned to the DomU in question. In the vast majority of server rooms I’ve encountered the number of public IP addresses has been greater than or equal to the number of MAC addresses with the only exception being corporate server rooms where everything runs on private IP address space (but there’s nothing wrong with 02:00:0a: as the prefix for a MAC address).

I also wonder if anyone else is thinking about the potential for MAC collisions. I’ve got Xen servers in a couple of server rooms, I told the relevant people in writing of my precise plans (and was assigned extra IP addresses for all the DomUs) but never had anyone mention any scheme for assigning MAC addresses.

2

Lenny SE Linux on the Desktop

I have been asked about the current status of Lenny SE Linux on the Desktop.

The first thing to consider is the combinations of policies and configurations. I will number them if only for the purpose of this post, if the numbering is considered generally helpful it could be more widely adopted to describe configurations.

  1. Default configuration. This has the default policy and is configured with all users having the domain unconfined_t and daemons such as POP servers are allowed to access home directories of type unconfined_home_dir_t. This allows such daemons to attack privileged user accounts.
  2. Some restricted users. This is the same as above but with some users restricted. Daemons such as POP servers are only allowed to access the home directories of restricted users. This means that if a user is to have an unconfined account and receive email they must have two Unix accounts or receive their mail under /var/spool/mail. This is one setsebool command and one (or maybe a few) “semanage login -m” commands from the default configuration.
  3. All users restricted. The system administrator has the domain sysadm_t and users have domains such as user_t. This requires a few more semanage commands. It is equivalent to the old strict policy.
  4. MLS. This is anything that is based around the MLS policy.

Currently I have two Desktop machines running Lenny (a test machine and my EeePC) and one server. I have only just switched my test machine to enforcing mode so have no good data on it (apart from the fact that I can boot it up and login – which is always a good start). The server is running in permissive mode because I have not yet written the policy to allow the POP server to read from unconfined_home_dir_t. I could get it working by switching from level 1 to level 2 or 3, but I want to get level 1 server policy working for the benefit of others else first.

My EeePC however is fully functional, I have been doing some work on it – that mostly means running a ssh client under GNOME but that’s OK (desktop environments such as GNOME and KDE are quite complex and demanding, getting a machine to boot and run such a desktop environment tests out many parts of the system). It’s only at level 1 for the moment because I want to get level 1 working everywhere before moving to the higher levels. I want to get things ready for real users ASAP. With the way the policy is managed now it will be possible to move from level 1 to 2 or 3 without rebooting or interrupting running services. So once users have systems running well at level 1 they can easily increase the security at a later date.

The problems that I have had are due to text relocations in libraries (see my previous post about execmod permission [1]). I’ve filed bug report #493678 against libtheora0 [2] in regard to this issue and included a patch from Fedora (which disables the non-relocatable assembly code in question). It seems that upstream have some new assembler code to try and fix this issue, so hopefully we’ll have something that can make it into Lenny!

I’ve filed bug report #493705 against libswscale0 for the same issue [3]. I included a patch to turn off the assembler code in question but that was not well received. If anyone has some i386 assembler skill and some spare time I would appreciate it if you could try and find a way to make the code position independent while losing little or no performance.

One thing to note is that I am now using an Opteron 1212 (2.0GHz dual-core) system for compiling, I run the i386 DomU with a 64bit kernel (I expect that 32bit user-space runs faster with a 64bit kernel than a 32bit kernel), and the disks are reasonably fast. Even so it takes about 15 minutes to build libswscale0 and the other packages from the same source tree. Previously I was using a 1.0GHz Pentium-3 for my Lenny i386 development until I had the libswscale0 build process go for more than 90 minutes before running out of disk space! If your build machine is old enough to only be 32bit then you should probably plan on watching a movie or going to bed while the build is in progress.

I have built packages that work around the above bugs and included them in my Lenny repository [4]. If you take the packages from that repository plus the Lenny packages then you should have a functional desktop system at level 1. I would appreciate it if people would start testing that and providing feedback. One important issue is the discovery of libraries that want shared stacks, text relocations, and executable memory. The deadline for fixing them properly is even more of a problem due to the number of people who have to be involved in a solution (as compared to the policy where I can do it on my own).

One finally problem is a bug in xdm which causes it to give the wrong context for login sessions due to having an old version of the SE Linux related code [5]. Due to a combination of this and some policy bugs you can not login with xdm. This is not a hugely important issue as most people will use gdm (which has the newer patch) or kdm (which has no SE Linux patch but can use pam_selinux.so). Also another option is wdm which works with pam_selinux.so. I’ve had a response to my bug report suggesting that there’s a bug in the patch (which was taken from gdm so maybe there’s a bug in gdm code too). I haven’t responded to that yet as I’ve been concentrating on the things that will make the most impact for Lenny.

At this stage I’m still unsure of when the release team will cut me off and prevent further SE Linux related fixes from going in Lenny. I need at least one more update to the policy packages before Lenny is released. I could release one right now with some useful improvements over what is currently in unstable, but am waiting until I get some other things fixed.

If I get everything fully working at level 1 (both client and server) before Lenny then I will provide a similar status report for users and testers of levels 2 and 3. I don’t expect that I will even get a chance to test level 4 (MLS) properly before Lenny releases.