4

Biba and BLP for Network Services

Michael Janke has written an interesting article about data flows in networks [1], he describes how data from the Internet should be considered to have low integrity (he refers to it as “untrusted”) and that as you get closer to the more important parts of the system it needs to be of higher integrity.

It seems to me that his ideas are very similar in concept to the Biba Integrity Model [2]. The Biba model is based around the idea that a process can only write data to a resource that is of equal or lower integrity and only read data from a resource that is of equal or higher integrity, this is often summarised as “no read-down and no write-up“. In a full implementation of Biba the OS would label all data (including network data) as to it’s integrity level and prevent any communication that violates the model (except of course for certain privileged programs – for example the file or database that stores user passwords must have high integrity but any user can run the program to change their password). A full Biba implementation would not work for a typical Internet service, but considering some of the concepts of Biba while designing an Internet service should lead to a much better design (as demonstrated in Michael’s post).

While considering the application of Biba to network design it makes sense to also consider consider the Bell LaPadula model (BLP) [3]. In computer systems designed for military use a combination of Biba and BLP is not uncommon, while a strict combination of those technologies would be an almost insurmountable obstacle to development of Internet services I think it’s worth considering the concepts.

BLP is a system that is primarily designed around the goal of protecting data confidentiality. Every process (subject) has a sensitivity label (often called a “clearance”) which is comprised of a sensitivity level and a set of categories and every resource that a process might access (object) also has a sensitivity label (often called a “classification”). If the clearance of the subject dominates the classification of the object (IE the level is equal or greater and the set of categories is a super-set) then read access is permitted, if the clearance of the subject is dominated by the classification of the object then write access is permitted, and the clearance and classification have to be equal for read/write access to be permitted. This is often summarised as “no write-down and no read-up“.

SGI has published a lot of documentation for their Trusted Irix (TRIX) product on the net, the section about mandatory access control covers Biba and BBLP [4]. I recommend that most people who read my blog not read the description of how Biba and BLP works, it will just give you nightmares.

The complexity of either Biba or BLP (including categories) is probably too great for consideration when designing network services which have much lower confidentiality requirements (even the loss of a few million credit card numbers is trivial compared to some of the potential results of leaks of confidential military data). But a simpler case of BLP with only levels is worth considering. You might have credit card numbers stored in a database classified as “Top Secret” and not allow less privileged processes to read from it. The data about customers addresses and phone numbers might be classified as “Secret” and all the other data might merely be “Classified”.

One way of using the concepts of Biba and BLP in the design of a complex system would be to label every process and data store in the system according to it’s integrity and classification/clearance. Then for the situations where data flows to processes with lower clearance the code could be well designed and audited to ensure that it does not leak data. For situations where data of low integrity (EG data from a web browser) is received by a process of high integrity (EG the login screen) the code would have to be designed and audited to ensure that it correctly parsed the data and didn’t allow SQL injection or other potential attacks.

I expect that many people who have experience with Biba and BLP will be rolling their eyes while reading this. The situation that we are dealing with in regard to PHP and SQL attacks over the Internet is quite different to the environments where proper implementations of Biba and BLP are deployed. We need to do what we can to try and improve things, and I think that the best way of improving things in terms of web application security would involve thinking about clearance and integrity as separate issues in the design phase.

SE Linux Policy Loading

One of the most significant tasks performed by a SE Linux system is loading the “policy“. The policy is the set of rules which determine what actions are permitted by each domain.

When I first started using SE Linux (in 2001) the kernel knew where to find the policy file and would just read the data from disk as soon as it had mounted the root filesystem. Doing such things is generally considered to be a bad idea, but it was an acceptable mechanism for an early release.

One issue is that the policy needs to be loaded very early in the system boot process, before anything significant happens. In the early days the design of SE Linux had no support for a process to change it’s security context other than by executing another process (similar to the way a non-root process in the Unix access control system can not change it’s UID, GID, or groups). Although later on support for this was added, it was only available as the request of the application (an external process could not change the context of an application without using ptrace – a concept that is too horrible to contemplate) and I am not aware of anyone actually using it. So it’s almost a requirement that there be no more than one active process in the system at the time that policy is loaded, therefore it must be init or something before init that loads the policy.

When it was decided that a user-space program had to instruct the kernel to load the policy we had to determine which program should do it and when it should be done, with the constraint that it had to be done early. The most obvious solution to this problem was to load the policy in the initramfs (or initrd as it was known at the time). One problem with this is that the initramfs is limited in size by kernel compilation options and may need to be recompiled to fit a bigger policy. As an experiment to work around this limitation I had a small policy (which covered the domains for init and the scripts needed for the early stages for system boot) loaded in the initramfs and then later in the boot process (some time after the root filesystem was mounted read/write) the full policy was loaded.

A more serious problem with including policy in the initramfs was that it required rebuilding the initramfs every time the policy changed in a significant way, of course scripts could not determine when a change was significant (neither could most users) so that required needless rebuilds (which wastes time). Even with a small policy for early booting loaded it was still sometimes necessary to change it and update the initramfs. I believe that as a general rule an initramfs should only be rebuilt when a new kernel is installed or when a radical change is made to the boot process (EG moving from single disk to software RAID, changing between AMD and Intel CPU architecture, changing SCSI controller, or anything else that would make the old initramfs not boot the machine). The initramfs that was used to boot my machine is known to actually work, the same can not be said for any new initramfs that I might generate.

But the deciding factor for me was support of machines that did not use an initramfs or initrd (such as the Cobalt machines [1] I own).

To solve these problems I first experimented with a wrapper for init. The idea was to divert the real init to another file name (or use the init= kernel command-line option) and then have the wrapper load the policy before running the real init. I never intended that to be a real solution, just to demonstrate a point. Once I had proven that it was possible to load the policy from user-space before running the real init program it was a small step to patch init to do this.

One slightly tricky aspect of this was in getting the correct security context for init. The policy has always been written to allow a domain transition from kernel_t to init_t when a file of type init_exec_t is executed. The domain kernel_t is applied to all active processes (including kernel threads) at the time the policy is loaded. So init only has to re-exec itself to get the correct context. Fortunately init is designed to do this in the case of an upgrade so this was easy to manage.

Since that time every implementation of SE Linux apart from some embedded systems has used init to load the policy.

The latest trend in Linux distributions seems to be using upstart [2] as a replacement for the old SysV Init. The Fedora developers decided to make nash (a program that comes from the mkinitrd source tree in Fedora and is a utility program for a Red Hat based initramfs) load the SE Linux policy as it would apparently be painful to patch every init to load the policy.

As far as I am aware there are only three different init programs in common use in Linux, the old SysV Init (which used to be used by everyone), Busybox (for rescuing broken systems and for embedded systems), and now Upstart (used by Ubuntu and Red Hat since Fedora 9). Embedded systems need to work differently to other systems in many ways (having the one Busybox program supply the equivalent to most coreutils in one binary is actually a small difference compared to the other things), and modifying the policy load process for embedded systems is trivial compared to all the other SE Linux work needed to get an embedded system working well. There are at least two commonly used initramfs systems (the Debian and Red Hat ones) and probably others. As one init system (SysV Init) already has SE Linux support it seems that only one needs to be patched to have complete coverage. I’ve just written a patch for Upstart (based on the version in Debian/Experimental) and sent it to an Ubuntu developer who’s interested in such things. I also volunteer to patch any other init system that is included in Debian (I am aware of minit and will patch it as soon as it’s description does not include “this package is experimental and not easy to install and use“).

It seems to me that repeating the work which was done for SysV Init and upstart for any other init system will be little effort, at worst no greater than patching an initramfs systems (and I’ll do it). As the number of initramfs systems that would need to be patched would exceed the number of init systems it seems that less work is involved in patching the init systems.

The amount of RAM required by the initramfs is in some situations a limitation on the use of a system, when I recently did some tests on swap performance by reducing the amount of RAM available to a Xen DomU [3] it was the initramfs that limited how small I could go. So adding extra code to the initramfs is not desired. While this will be a small amount of code in some situations (when I patched /sbin/init from Upstart it took an extra 64 bytes of disk space on AMD64), dragging in the libraries can take a moderate amount of space (the fact that an LVM or encrypted root filesystem causes SE Linux libraries to be included in the initramfs is something that I consider to be a bug and plan to fix).

Finally not all boot loaders support an initrd or initramfs. I believe that any decision which prevents using such sweet hardware as Cobalt Qube and Raq machines from being used with SE Linux is a mistake. I have both Qube and Raq machines running fine with Debian SE Linux and plan to continue making sure that Debian will support SE Linux on such hardware (and anything with similar features and limitations).

6

Review of the EeePC 701

I have just bought a EeePC 701 [1], I chose the old model because it’s significantly smaller than the 90x series and a bit lighter too and it had Linux pre-loaded. Also it was going cheap, while I am not paying for it I give the same attention to saving my clients’ money as to saving my own. I ruled out everything that was heavier or larger than an EeePC 901 and everything that cost more than $700. That left only the Linux version of the EeePC 901 (which I couldn’t find on sale) and the EeePC 701 as my options. I also excluded the EeePC 900 because it is bigger and heavier than the 701 but has the same CPU (and therefore can’t run Xen).

In terms of it’s prime purpose (a SSH client) 96*22 characters is the size of a konsole screen with the default (medium) font size when the window is maximised, that is 5% more characters than the standard terminal size of 80*25 but the smaller number of lines is a problem. When using the small font I can get 129*29 which I find quite comfortable to read (it would be impossible for me to read withut glasses – which means having almost average vision for someone in their 30’s). Then I can get 129*31 if I dismiss the tool bar at the bottom of the screen and could probably get another couple of rows if I removed the tabs that Konsole uses to switch between sessions. That would only be viable if using screen extensively as a single session without the ability to switch between programs is not particularly useful (I don’t think that ALT-TAB is adequate for switching between terminal sessions). When running Debian I can get 130*32 with the same font due to smaller window controls, but I’ll write more about converting to Debian in another post. Note that while the OS that ships with the Linux based EeePC machines is based on Debian, it is heavily customised and has some notable differences from a typical Debian install. It has some proprietary software, and uses unionfs for the root and /home filesystems.

The first issue is that Console (the KDE terminal program) can only be accessed from the file manager (via the tools menu or ^t), the machine clearly doesn’t have defaults for someone like me. In principle it’s a multi-user system that can be fully customised, but in practice it’s configured as a single-user machine. Once you have a Console window open you can run “su -” and the root password is the password for the “user” account.

I wonder whether I could get more than 42 rows or more than 140 columns of text that is readable. If so then I could have two console windows fully displayed on screen.

The screen is bright and clear, this is essential as the number of pixels per character is going to be low for any reasonable amount of text to be displayed on screen.

The password that you set when you first use the machine also works for “su -” (in fact that is the only real use as I expect that almost everyone will choose the automatic login option).

The display comprises a significant portion of the weight, if the screen is fully open (about 150 degrees) then it will tip over. Even when the screen is not as far open it will tip due to bumps if resting on your lap on a tram. It’s a pity that the screen is connected at the very back of the base, if the attachment point was a bit closer then it would balance better and also be easy to hold with one hand. The depth of the machine combined with the angle at the back makes it impossible for me to get a good one-handed grip from the base, so typing while standing on a moving tram or bus will be extremely difficult (unlike my iPaQ which I can use at full speed on any form of public transport). Inidentally it would be good if there was an attachment point for a wrist-strap, every camera and most mobile phones have them so it would be good to have the same safety feature in a laptop to facilitate use on public transport. Another reason for not using it as a PDA is the fact that it takes about 7 seconds to resume after hibernating (when the lid is closed).

The PSU is almost as small as that of a mobile phone! This is a major benefit as in the past I have often stored a Thinkpad PSU at a client site for 9-5 jobs as it is heavy enough that I didn’t want to carry it on public transport. The EeePC PSU is light enough that it won’t be unpleasant to carry, and small enough to fit easily into a jacket pocket.

The OS installation is very well done for the basics. It’s easy to launch applications and there is a good selection of educational programs (including the periodic table, planetarium, typing, letters, and hangman, drawing, and a link to www.skoool.com). It’s a pity that they organised the folders according to the area rather than the age, but generally the OS is very well done. Most of the reviews focus on the speed of the CPU, the RAM expansion options, etc, but miss the fact that it is a really nice machine for using as-is.

It is a much better machine for teaching children than any of the machines I’ve seen which are sold specifically for children (see my previous post about an awful computer for kids [2]). I believe that you could give an EeePC to any 3yo and have them doing something useful in a matter of minutes! The ability to freely install new software should not be overlooked when considering a computer for children to use. Someone who buys one now could use it for a few years as an ssh client and then reinstall the original OS and give it to a child as an educational toy.

It has a program to create OGG video files from it’s built-in camera and mic, at this moment it’s the only device I own that can create OGG files (this is a good thing). OGG compression takes a really long time, the Atom CPU in the 901 and 1000x series would be good for this. It’s a pity that the microphone is directly below the mouse buttons, it clearly records the mouse click used to finish recording. The version of mplayer which is installed to play OGG files can also play FLV files downloaded from youtube with youtube-dl (althrough the file association is not set). When I tried to play a MP4 file from ted.com it only gave audio (the video works on a full Debian/Etch installation).

There is a full set of office software, OpenOffice.org, Thunderbird email client, and all the other stuff you might expect.

I find that the biggest problem for using it is the size of the keyboard, I don’t think I’ll ever be able to touch-type properly on it. Not only are the keys small but the positions of non-alphabetical keys are slightly different from most keyboards. Another problem is that the space-bar needs to be pressed near the centre, I usually like to press near the end but that doesn’t work. Those issues are all trade-offs of the small size. My T series Thinkpad is reasonably portable and has a great keyboard so I have the serious typing while travelling angle covered.

One real mistake in the keyboard design is the lack of a LED to indicate whether caps-lock is enabled (there are four status LEDs, adding number five couldn’t be that expensive), this is a real problem when using the vi editor (which uses letters as editor commands and is case-sensitive). It will also occasionally cause problems when entering passwords. There is a caps-lock indicator on screen, but that is in the toolbar at the bottom of the scren (which I like to dismiss to gain an extra two lines of text). It would be good if I could display the status of the caps and num lock keys in the right side of the title-bar of the active window (the only bit of unused space on the screen).

The cooling fan makes an annoying buzz. It’s significantly louder than Thinkpads which dissipate a lot more heat.

The other problems are all software, which is OK as I plan to reinstall it. Firstly it is using Debian and shipped with the broken openssl library. There is a GUI for installing upgrades, but it recommends rebooting after installing each one! Naturally I didn’t choose to reboot, I installed the security update #1.

Then I clicked on the button to install a BIOS update. It told me that it had to reboot to apply the update and gave only one button (OK), I tried closing the window but it rebooted anyway (fortunately the vi swap file allowed me to recover this post – which I am entirely writing on the EeePC).

Aftere booting up again I discovered that the libssl bug still wasn’t fixed and that there was a second udate to apply! Why can’t they have a “apply all updates” button and also have it not automatically reboot? This must be the only Debian-based distribution that forces Windows-style reboots.

But that said, while they made some mistakes in their software it generally provides a good user experience

7

Pollution and Servers

There is a lot of interest in making organisations “green” nowadays. One issue is how to make the IT industry green. People are talking about buying “offsets” for CO2 production, but the concern is that some of the offset schemes are fraudulent. Of course the best thing to do is to minimise the use of dirty power as much as possible.

Of course the first thing to do is to pay for “green power” (if available) and if possible install solar PV systems on building roofs. While the roof space of a modern server room would only supply a small amount of the electricity needed (maybe less than needed to power the cooling) every little bit helps. The roof space of an office building can supply a significant portion of the electricity needs, two years ago Google started work on instralling Solar PV panels on the roof of the “Googleplex” [1] with the aim of supplying 30% of the building’s power needs.

For desktop machines a significant amount of power can be saved if they are turned off overnight. For typical office work the desktop machines should be idle most of the time, so if the machine is turned off outside business hours then it will use something close to 45/168 of the power that it might otherwise use. Of course this requires that the OS support hibernation (which isn’t supported well enough in Linux for me to want to use it) or that applications can be easily stopped and restarted so that the system can be booted every morning. One particular corner case is that instant-messaging systems need to be server based with an architecture that supports storing messages on the server (as Jabber does [2]) rather than requiring that users stay connected (as IRC does). Of course there are a variety of programs to proxy the IRC protocol and using screen on a server to maintain a persistent IRC presence is popular among technical users (for a while I used that at a client site so that I could hibernate the PowerMac I had on my desktop when I left the office).

It seems that most recent machines have BIOS support for booting at a pre-set time. This would allow the sys-admin to configure the desktop machines to boot at 8:00AM on every day that the office is open. That way most employees will arrive at work to find that their computer is already booted up and waiting for them. We have to keep in mind the fact that when comparing the minimum pay (about $13 per hour in Australia) with the typical electricity costs ($0.14 per KWh – which means that a desktop computer might use $0.14 of electricity per day) there is no chance of saving money if employee time is wasted. While companies are prepared to lose some money in the process of going green, they want to minimise that loss as much as possible.

The LessWatts.org project dedicated to saving energy on Linux systems reports that Gigabit Ethernet uses about 2W more power than 100baseT on the same adapter [3]. It seems most likely that similar savings can be achieved from other operating systems and also from other network hardware. So I expect that using 100baseT speed would not only save about 2W at the desktop end, but it would also save about 2W at the switch in the server-room and maybe 1W in cooling as well. If you have a 1RU switch with 24 Gig-E ports then that could save 48W if the entire switch ran at 100baseT speed, compared to a modern 1RU server which might take a minimum of 200W that isn’t very significant.

The choice of server is going to be quite critical to power use, it seems that all vendors are producing machines that consume less power (if only so that they can get more servers installed without adding more air-conditioning), so some effort in assessing power use before purchase could produce some good savings. When it comes time to decommission old servers it is a good idea to measure the power use and decommission the most power hungry ones first whenever convenient. I am not running any P4 systems 24*7 but have a bunch of P3 systems running as servers, this saves me about 40W per machine.

It’s usually the case that the idle power is a significant portion of the maximum power use. In the small amount of testing I’ve done I’ve never been able to find a case where idle power was less than 50% of the maximum power – of course if I spun-down a large number of disks when idling this might not be the case. So if you can use one virtual server that’s mostly busy instead of a number of mostly idle servers then you can save significant amounts of power. Before I started using Xen I had quite a number of test and development machines and often left some running idle for weeks (if I was interrupted in the middle of a debugging session it might take some time to get back to it). Now if one of my Xen DomU’s doesn’t get used for a few weeks it uses little electricity that wouldn’t otherwise be used. It is also possible to suspend Xen DomU’s to disk when they are not being used, but I haven’t tried going that far.

Xen has a reputation for preventing the use of power saving features in hardware. For a workstation this may be a problem, but for a server that is actually getting used most of the time it should not be an issue. KVM development is apparently making good progress, and KVM does not suffer from any such problems. Of course the down-side to KVM is that it requires an AMD64 (or Intel clone) system with hardware virtualisation, and such systems often aren’t the most energy efficient. A P3 system running Xen will use significantly less power than a Pentium-D running KVM – server consolidation on a P3 server really saves power!

I am unsure of the energy benefits of thin-client computing. I suspect that thin clients can save some energy as the clients take ~30W instead of ~100W so even if a server for a dozen users takes 400W there will still be a net benefit. One of my clients does a lot of thin-client work so I’ll have to measure the electricity use of their systems.

Disks take a significant amount of power. For a desktop system they can be hibernated at times (an office machine can be configured such that the disks can spin-down during a lunch break). This can save 7W per disk (the exact amount depends on the type of disk and the efficiency of the PSU – (see the Compaq SFF P3 results and the HP/Compaq Celeron 2.4GHz on my computer power use page [4]). Network booting of diskless workstations could save 7W for the disk (and also reduce the noise which makes the users happy) but would drive the need for Gigabit Ethernet which then wastes 4W per machine (2W at each end of the Ethernet cable).

Recently I’ve been reading about the NetApp devices [5]. By all accounts the advanced features of the NetApp devices (which includes their algorithms for the use of NVRAM as write-back cache and the filesystem journaling which allows most writes to be full stripes of the RAID) allow them to deliver performance that is significantly greater than a basic RAID array with a typical filesystem. It seems to me that there is the possibility of using a small number of disks in a NetApp device to replace a larger number of disks that are directly connected to hosts. Therefore use of NetApp devices could save electricity.

Tele-commuting has the potential to save significant amounts of energy in employee travel. A good instant-messaging system such as Jabber could assist tele-commuters (it seems that a Jabber server is required for saving energy in a modern corporate environment).

Have I missed any ways that sys-admins can be involved in saving energy use in a corporation?

Update: Albert pointed out that SSD (Solid State Disks) can save some power. They also reduce the noise of the machine both by removing one moving part and by reducing heat (and therefore operation of the cooling fan). They are smaller than hard disks, but are large enough for an OS to boot from (some companies deliberately only use a small portion of the hard drives in desktop machines to save space on backup tapes). It’s strange that I forgot to mention this as I’m about to buy a laptop with SSD.

9

Is a GPG pass-phrase Useful?

Does a GPG pass-phrase provide a real benefit to the majority of users?

It seems that there will be the following categories of attack which result in stealing the secret-key data:

  1. User-space compromise of account (EG exploiting a bug in a web browser or IRC client).
  2. System compromise (EG compromising a local account and exploiting a kernel vulnerability to get root access).
  3. Theft of the computer system while powered down when the system was configured to not use swap or to encrypt the swap space with a random key at boot time.
  4. Theft of a computer system while running or that did not have encrypted swap.
  5. Theft of unencrypted backup media.

Category 1 will permit an attacker to monitor user processes and intercept one that asks for a GPG pass-phrase as well as to copy the secret key. Category 2 will do the same but for all users on the system.

Category 3 will give the potential for stealing the private key (if it’s not encrypted) but no direct potential for getting the pass-phrase.

Category 4 has the potential for copying a pass-phrase from memory or swap. I am inclined to trust Werner Koch (and anyone else who submitted code to the GPG project) to have written code to correctly lock memory and scrub pass-phrase data and decrypted private key data from memory after use. But I really doubt the ability of most people who write code to interface with GPG to do the same. So every time that a GUI program prompts for a GPG pass-phrase I think that there is the potential for it to be stored in swap or to remain indefinitely in RAM. Therefore stealing a machine that does not have it’s swap-space encrypted with a random key (which is the most practical way of encrypting swap) or stealing a running machine (as mentioned in a previous post [1]) can potentially grant a hostile party access to the pass-phrase.

So it seems to me that out of all the possible ways of getting access to a GPG private key, the only ones where a pass-phrase ones is going to really do some good are categories 3 and 5. While it’s good to protect against those situations, it seems to me that the greatest risk to a GPG key is from category 1, with category 2 following close behind.

I previously wrote about the slow progress towards using SE Linux and GPG code changes to make it more difficult to steal the secret key [2] – something that I’ve been occasionally working on over the last 6 years.

Now it seems to me that the same benefits can and should be made available to people who don’t use SE Linux. If a system directory such as /var/spool/gpg was mode 1770 then gpg could be setgid to group “gpg” so that it could create and access secret keys for users under /var/spool/gpg while the users in question could not directly access them. Then the sys-admin would be responsible for backing up GPG keys. Of course it would probably be ideal to have an option as to whether a new secret key would be created in the system spool or in the user home directory, and migrating the key from the user home directory to the system spool would be supported (but not migrating it back).

This would mean that an attacker who compromised a local user account (maybe through a vulnerability in a web browser or MUA) would not be able to get the GPG secret key. They could probably get the pass-phrase by ptracing the MUA (or some other GUI process that calls GPG) but without the secret key itself that would not do as much good – of course once they had the pass-phrase and local access they could use the machine so sign and decrypt data which would still be a bad thing. But it would not have the same scope as stealing the secret key and the pass-phrase.

I look forward to reading comments on this post.

6

Label vs UUID vs Device

Someone asked on a mailing list about the issues related to whether to use a label, UUID, or device name for /etc/fstab.

The first thing to consider is where the names come from. The UUID is assigned automatically by mkfs or mkswap, so you have to discover it after the filesystem or swap space has been made (or note it during the mkfs/mkswap process). For the ext2/3 filesystems the command “tune2fs -l DEVICE” will display the UUID and label (strangely mke2fs uses the term “label” while the output of tune2fs uses the term “volume name“). For a swap space I don’t know of any tool that can extract the UUID and name. On Debian (Etch and Unstable) the file command does not display the UUID for swap spaces or ext2/3 filesystems and does not display the label for ext2/3 filesystems. After I complete this blog post I will file a bug report.

If you are using a version of Debian earlier than Lenny (or a version of Unstable with this bug fixed) then you will be able to easily determine the label and UUID of a filesystem or swap space. Other than that the inconvenience of determining the UUID and label will be a reason for not using them in /etc/fstab (keep in mind that sys-admin work sometimes needs to be done at 3AM).

One problem with mounting by UUID or label is that it doesn’t work well with snapshots and block device backups. If you have a live filesystem on /dev/sdc and an image from a backup on /dev/sdd then there is a lot of potential for excitement when mounting by UUID or label. Snapshots can be made by a volume manager (such as LVM), a SAN, or an iSCSI server.

Another problem is that if a file-based backup is made (IE tar or cpio) then you lose the UUID and label. tune2fs allows setting the UUID, but that seems like a potential recipe for disaster. So this means that if mounting by UUID then you would potentially need to change /etc/fstab after doing a full filesystem restore from a file-based backup, this is not impossible but might not be what you desire. Setting the label is not difficult, but it may be inconvenient.

When using old-style IDE disks the device names were of the form /dev/hda for the first disk on the first controller (cable) and /dev/hdd for the second disk on the second controller. This was quite unambiguous, adding an extra disk was never going to change the naming.

With SCSI disks the naming issue has always been more complex, and which device gets the name /dev/sda was determined by the order in which the SCSI HAs were discovered. So if a SCSI HA which had no disks attached suddenly had a disk installed then the naming of all the other disks would change on the next boot! To make things more exciting Fedora 9 is using the same naming scheme for IDE devices as for SCSI devices, I expect that other distributions will follow soon and then even with IDE disks permanent names will not be available.

In this situation the use of UUIDs or LABELS is required for the use of partitions. However a common trend is towards using LVM for all storage, in this case LVM manages labels and UUIDs internally (with some excitement if you do a block device backup of an LVM PV). So LV names such as /dev/vg0/root then become persistent and there is no need for mounting via UUID or label.

The most difficult problem then becomes the situation where a FC SAN has the ability to create snapshots and make them visible to the same machine. UUID or label based mounting won’t work unless you can change them when creating the snapshot (which is not impossible but is rather difficult when you use a Windows GUI to create snapshots on a FC SAN for use by Linux systems). I have had some interesting challenges with this in the past when using a FC based SAN with Linux blade servers, and I never devised a good solution.

When using iSCSI I expect that it would be possible to force an association between SCSI disk naming and names on the server, but I’ve never had time to test it out.

Update: I have submitted Debian bug #489865 with a suggested change to the magic database.

Below are /etc/magic entries for displaying the UUID and label on swap spaces and ext2/3 filesystems:

Continue reading

Shared Context and Blogging

One interesting aspect of the TED conference [1] is the fact that they only run one stream. There is one lecture hall with one presentation and everyone sees the same thing. This is considerably different to what seems to be the standard practice for Linux conferences (as implemented by LCA, OLS, and Linux Kongress) where there are three or more lecture halls with talks in progress at any time. At a Linux conference you might meet someone for lunch and start a conversation by asking “did you attend the lecture on X“, as there are more than two lecture halls the answer is most likely to be “no“, which then means that you have to describe the talk in question before talking about what you might really want to discuss (such as how a point made in the lecture in question might impact the work of the people you are talking two). In the not uncommon situation where there is an interesting implication of combining the work described in two lectures it might be necessary to summarise both lectures before describing the implication of combining the work.

Now there are very good reasons for running multiple lecture rooms at Linux conferences. The range of topics is quite large and probably very few delegates will be interested in the majority of the talks. Usually the conference organisers attempt to schedule things to minimise the incidence of people missing talks that interest them, one common way of doing so is to have conference “streams”. Of course when you have for example a “networking” stream, a “security” stream, and a “virtualisation” stream then you will have problems when people are interested in the intersection of some of those areas (virtual servers do change things when you are working on network security).

There seem some obvious comparisons between Planet installations (as aggregates of RSS feeds) and conferences (as aggregates of lectures). On Planet Debian [2] there has traditionally been a strong shared context with many blog posts referring to the same topics – where one person’s post has inspired others to write about similar topics. After some discussion (on blogs and by email) it was determined that there would be no policy for Planet Debian and that anyone who doesn’t want to read some of the content should filter the feed. Of course this means that the number of people who read (or at least skim) the entire feed will drop and therefore we lose the shared context.

Planet Linux Australia [3] currently has a discussion about the issue of what types of content to aggregate. Michael Davies has just blogged a survey about what types of content to include [4]. I think it’s unfortunate that he decided to name the post after one blogger who’s feed is aggregated on that Planet as that will encourage votes on the specific posts written by that person rather than the general issue. But I think it’s much better to tailor a Planet to the interests of the people who read it than to include everything and encourage readers to read a sub-set.

When similar issues were in discussion about Planet Debian I wrote about my ideas on the topic [5]. In summary I think that the Gentoo idea of having two Planet installations (one for the content which is most relevant and one for everything that is written by members) is a really good one. It’s also a good thing to have a semi-formal document about the type of content that is expected – this would be useful both for using a limited feed for people who go significantly off-topic and as a guideline for people who want to write posts that will be appreciated by the majority of the readers. Planet Ubuntu has a guideline, but it was not very formal last time I checked.

Finally in regard to short posts, they generally don’t interest me much. If I want to get a list of hot URLs then I could go to any social media site to find some. I write a list post at most once a month, and I generally don’t include a URL in the list unless I have a comment to make about it. I always try to describe each page that I link to in enough detail that if the reader can’t view it then they at least have some idea of what it is about (no “this is cool” or “this sucks” links).

12

Kernel Security vs Uptime

For best system security you want to apply kernel security patches ASAP. For an attacker gaining root access to a machine is often a two step process, the first step is to exploit a weakness in a non-root daemon or take over a user account, the second step is to compromise the kernel to gain root access. So even if a machine is not used for providing public shell access or any other task which involves giving user access to potential hostile people, having the kernel be secure is an important part of system security.

One thing that gets little consideration is the overall effect of applying security updates on overall uptime. Over the last year there have been 14 security related updates (I count a silent data loss along with security issues) to the main Debian Etch kernel package. Of those 14, it seems that if you don’t use DCCP, NAT for CIFS or SNMP, IA64, the dialout group, then you will only need to patch for issues 2, 3 (for SMP machines), 4, 5, 7 (sound drivers get loaded on all machines by default), 9, 10, 11, 12, 13, and 14.

This means 11 reboots a year for SMP machines and 10 a year for uni-processor machines. If a reboot takes three minutes (which is an optimistic assumption) then that would be 30 or 33 minutes of downtime a year due to kernel upgrades. In terms of uptime we talk about the number of “nines”, where the ideal is generally regarded as “five nines” or 99.999% uptime. 33 minutes of downtime a year for kernel upgrades means that you get 99.993% uptime (which is “four nines”). If a reboot takes six minutes (which is not uncommon for servers) then it’s 99.987% uptime (“thee nines”).

While it doesn’t seem likely to affect the number of “nines” you get, not using SMP has the potential to avoid future security issues. So it seems that when using a Xen (or other virtualisation technology) assigning only one CPU to the DomUs that don’t need any more could improve uptime for them.

For Xen Dom0’s which don’t have local users or daemons, don’t use DCCP, NAT for CIFS or SNMP, wireless, CIFS, JFFS2, PPPoE, bluetooth, H.323 or SCTP connection tracking, then only issue 11 applies. However for “five nines” you need to have 5 minutes of downtime a year or less. It seems unlikely that a busy Xen server can be rebooted in 5 minutes as all the DomUs need to have their memory saved to disk (writing out the data to disk and reading it back in after a reboot will probably take at least a couple of minutes) or they need to be shutdown and booted again after the Dom0 is rebooted (which is a good procedure if the security fix affects both Dom0 and DomU use), and such shutdowns and reboots of DomU’s will take a lot of time.

Based on the past year, it seems that a system running as a basic server might get “four nines” if configured for a fast boot (it’s surprising that no-one seems to be talking about recent improvements to the speed of booting as high-availability features) and if the boot is slower then you are looking at “three nines”. For a Xen server unless you have some sort of cluster it seems that “five nines” is unattainable due to reboot times if there is one issue a year, but “four nines” should be easy to get.

Now while the 14 issues over the last year for the kernel seems likely to be a pattern that will continue, the one issue which affects Xen may not be representative (small numbers are not statistically significant). I feel confident in predicting a need for between 5 and 20 kernel updates next year due to kernel security issues, but I would not be prepared to bet on whether the number of issues affecting Xen will be 0, 1, or 4 (it seems unlikely that there would be 5 or more).

I will write a future post about some strategies for mitigating these issues.

Here is my summary of the Debian kernel linux-image-2.6.18-6-686 (Etch kernel) security updates according to it’s changelog, they are not in chronological order, it’s the order of the changelog file:
Continue reading

7

Car vs Public Transport to Save Money

I’ve just been considering when it’s best to drive and when it’s best to take public transport to save money. My old car (1999 VW Passat) uses 12.8L/100km which at $1.65 per liter means 21.1 cents per km on fuel. A new set of tires costs $900 and assuming that they last 20,000km will cost 4.5 cents per km. A routine service every 10,000Km will cost about $300 so that’s another 3 cents per km. While it’s difficult to estimate the cost per kilometer of replacing parts that wear out, it seems reasonable to assume that over 100,000Km of driving at least $20,000 will be spent on parts and the labor required to install them, this adds another 20 cents per km.

The total then would be 48.6 cents per km. The tax deduction for my car is 70 cents per km of business use, so if my estimates are correct then the tax deductions exceed the marginal costs of running a vehicle (the costs of registration, insurance, and asset depreciation however make the car significantly more expensive than that – see my previous post about the costs of owning a small car for more details [1]). So for business use the marginal cost after tax deductions are counted is probably about 14 cents per km.

Now a 2 hour ride on Melbourne’s public transport costs $2.76 (if you buy a 10 trip ticket). For business use that’s probably the equivalent cost to 20Km of driving. The route I take when driving to the city center is about 8Km, that gets me to the nearest edge of the CBD (Central Business District) and doesn’t count the amount of driving needed to find a place to park. This means the absolute minimum distance I would drive when going to the CBD would be 16Km. The distance I would drive on a return trip to the furthest part of the CBD would be almost exactly 20km. So on a short visit to the central city area I might save money by using my car if it’s a business trip and I tax-deduct the distance driven. A daily ticket for the public transport is equivalent to two 2 hour tickets (if you have a 10 trip ticket then if you use it outside the two hour period it becomes a daily ticket and uses a second credit). If I could park my car for an out of pocket expense of less than $2.76 (while I can tax-deduct private parking it’s so horribly expensive that it would cost at least $5 after deductions are counted) then I could possibly save money by driving. There were some 4 hour public parking spots that cost $2.

So it seems that for a basic trip to the CBD it’s more expensive to use a car than to take a tram when car expenses are tax deductible. For personal use a 5.7km journey would cost as much as a 2 hour ticket for public transport and a 11.4km journey would cost as much as a daily ticket. The fact that public transport is the economical way to travel for such short distances is quite surprising. In the past I had thought of using a tram ticket as an immediate cost while considering a short car drive as costing almost nothing (probably because the expense comes days later for petrol and years later for servicing the car).

Also while there is a lot of media attention recently about petrol prices, it seems that for me at least petrol is still less than half the marginal cost of running a car. Cars are being advertised on the basis of how little fuel they use to save money, but cars that require less service might actually save more money. There are many cars that use less fuel than a VW Passat, and also many cars that are less expensive to repair. It seems that perhaps the imported turbo-Diesel cars which are becoming popular due to their fuel use may actually be more expensive than locally manufactured small cars which have cheap parts.

Update: Changed “Km” to “km” as suggested by Lars Wirzenius.

3

Letter Frequency in Account Names

It’s a common practice when hosting email or web space for large numbers of users to group the accounts by the first letter. This is due to performance problems on some filesystems with large directories and due to the fact that often a 16bit signed integer is used for the hard link count so that it is impossible to have more than 32767 subdirectories.

I’ve just looked at a system I run (Bluebottle anti-spam email service [1]) which has about half a million accounts and counted the incidence of each first letter. It seems that S is the most common at almost 10% and M and A aren’t far behind. Most of the clients have English as their first language, naturally the distribution of letters would be different for other languages.

Now if you were to have a server with less than 300,000 accounts then you could probably split them based on the first letter. If there were more than 300,000 accounts then you would face the risk of having there be too many account names starting with S. See the table below for the incidences of all the first letters.

The two letter prefix MA comprised 3.01% of the accounts. So if faced with a limit of 32767 sub-directories then if you split by two letters then you might expect to have no problems until you approached 1,000,000 accounts. There were a number of other common two-letter prefixes which also had more than 1.5% of the total number of accounts.

Next I looked at the three character prefixes and found that MAR comprised 1.06% of all accounts. This indicates that splitting on the first three characters will only save you from the 32767 limit if you have 3,000,000 users or less.

Finally I observed that the four character prefix JOHN (which incidentally is my middle name) comprised 0.44% of the user base. That indicates that if you have more than 6,400,000 users then splitting them up among four character prefixes is not necessarily going to avoid the 32767 limit.

It seems to me that the benefits of splitting accounts by the first characters is not nearly as great as you might expect. Having directories for each combination of the first two letters is practical I’ve seen directory names such as J/O/JOHN or JO/JOHN (or use J/O/HN or JO/HN if you want to save directory space). But it becomes inconvenient to have J/O/H/N and the form JOH/N will have as many as 17,576 subdirectories for the first three letters which may be bad for performance.

This issue is only academic as far as most sys-admins won’t ever touch a system with more than a million users. But in terms of how you would provision so many users, in the past the limits of server hardware were approached long before these issues. For example in 2003 I was running some mail servers on 2RU rack mounted systems with four disks in a RAID-5 array (plus one hot-spare) – each server had approximately 200,000 mailboxes. The accounts were split based on the first two letters, but even if it had been split on only one letter it would probably have worked. Since then performance has improved in all aspects of hardware. Instead of a 2RU server having five 3.5″ disks it will have eight 2.5″ disks – and as a rule of thumb increasing the number of disks tends to increase performance. Also the CPU performance of servers has dramatically increased, instead of having two single-core 32bit CPUs in a 2RU server you will often have two quad-core 64bit CPUs – more than four times the CPU performance. 4RU machines can have 16 internal disks as well as four CPUs and therefore could probably serve mail for close to 1,000,000 users.

While for reliability it’s not the best idea to have all the data for 1,000,000 users on internal disks on a single server (which could be the topic of an entire series of blog posts), I am noting that it’s conceivable to do so and provide adequate performance. Also of course if you use one of the storage devices that supports redundant operation (exporting data over NFS, iSCSI, or Fiber Channel) then if things are configured correctly then you can achieve considerably more performance and therefore have a greater incentive to have the data for a larger number of users in one filesystem.

Hashing directory names is one possible way of alleviating these problems. But this would be a little inconvenient for sys-admin tasks as you would have to hash the account name to discover where it was stored. But I guess you could have a shell script or alias to do this.

Here is the list of frequency of first letters in account names:

First Letter Percentage
a 7.65
b 5.86
c 5.97
d 5.93
e 2.97
f 2.85
g 3.57
h 3.19
i 2.21
j 6.09
k 3.92
l 3.91
m 8.27
n 3.15
o 1.44
p 4.82
q 0.44
r 5.04
s 9.85
t 5.2
u 0.85
v 1.9
w 2.4
x 0.63
y 0.97
z 0.95