8

SE Linux in Debian

I have now got a Debian Xen domU running the strict SE Linux policy that can boot in enforcing mode. I expect that tomorrow I will have it working with full functionality and that I will be able to run another SE Linux Play Machine in the near future.

After getting the strict policy working I want to build a Debian kernel with CONFIG_AUDITSYSCALL and an audit package so that I can audit system calls that an application makes and also so that the auditd can collect the SE Linux log messages. Other people have talked about packaging audit for Debian, hopefully one of them will do it first and save me the effort, but it shouldn’t be too difficult to do if they don’t.

Then I need to investigate some options for training people about SE Linux. As I don’t currently have the bandwidth for serving large files I’m thinking of basing some SE Linux training on Xen images from the jailtime.org repository. My rough plan at the moment is to have people download Xen images, run through them while consulting a web page, and ask questions on an IRC channel. I’m not sure what the demand will be for this but some web pages teaching people about SE Linux will be a useful resource even if the IRC based training doesn’t work out.

Another thing I want to do is to get PolyInstantiated Directories working in Debian. The pam_namespace.so module needed for this is written for a more recent version of PAM, so I might just work on merging the Debian patches with the latest upstream PAM instead of back-porting the module to the ancient Debian PAM.

4

Lord of the Flies

Apparently they are planning a reality TV show that might be named “Kids Nation” where 40 children are allowed to develop their own society for 40 days. The claims are that there is “no parental supervision”, but of course with many TV cameras watching I’m sure that even the most stupid children can work out that there are reasons to behave relatively well. Also I’m sure that they will remove children from the show if it becomes necessary to avoid injury.

It will be interesting to see how well this develops. Maybe it could evolve into a new form of schooling, just leave children alone in a school with lots of security cameras watching them and let them study or not as they wish. It would be cheaper than running a regular high-school, and as there is little scope for providing a worse education than many (most?) high-schools currently provide it should be worth a try.

For a long time I have thought that the premise behind The Lord of the Flies was bogus. I believe that a major contributing factor towards much of the violence that is present in schools is the pointlessness of the entire system. There are very few students who are so stupid that they can’t realise that the education system is failing them. Whenever you put a large number of people in a confined space with nothing useful to do the results will be bad. If a group of students from a violent school were placed on an island without any supplies then I’m sure that they would soon realise that they need to cooperate to stay alive.

Update:
It seems that the Sudbury Valley school implements some ideas similar to mine. They also have a page of links to some other schools that do similar things.

priorities for heartbeat services

Currently I am considering the priority scheme to use for some highly available services running on Linux with Heartbeat.

The Heartbeat system has a number of factors that can be used to determine the weight for running a particular service on a given node. One is the connectivity to other systems determined by ping (every system that is pingable can add a value to the score), one is the number of failures (every failure deducts a value from the total score), one is the weight for staying on the same node (IE if the situation changes and the current node is not the ideal node you might not want to immediately move the service to a different node as that gives some seconds of no service), and one is the preference for each node that may run the service.

For a given node to run a particular service then the score has to be greater than all other nodes and also greater than zero. If all nodes have a score that is zero or less then the service will not run.

Now in the case of a service that repeatedly fails (EG a filesystem mount that relies on a hardware RAID which is not connected) then what should we do? One option is to have the score for running on a particular node be for example 100 times the value that is subtracted on failure. In this case after 100 failures on that node (and an appropriate number of failures on other nodes which are permitted to run the service) it will be disabled. Then the service has to be explicitly re-enabled (or a node rebooted) before it will run again.

The other option would be to have the value that is subtracted on failure be less than a billionth of the score for running on a particular node, so that the service will keep trying to start for the next few hundred years. The up-side of this is that there is less fiddling required, the down-side is that some CPU and disk resources will be kept active in repeatedly starting the service.

Now I have to decide which option to take in this regard, any comments would be appreciated.

1

career risks

Paul Graham makes some interesting observations about taking risks to achieve career benefits.

One thing he doesn’t mention is that the risks have to match your life situation. If you are 21, living with your parents, and single (typical for a CS graduate) then you should take the riskiest options in terms of your career (apart from working in Iraq of course). If you don’t have much money then you don’t have much to lose. If you live with your parents then you still have accommodation and food even if you have no money. If you have no dependents (SO or children) then there’s nothing compelling you to earn a certain income.

When you get older you may get a mortgage, a SO, and/or children. Also you won’t live with your parents forever. Most career risks that you might want to take aren’t possible if you leave them too late.

Finally if you do something risky such as starting your own company and it doesn’t work out then it’s still going to look good on your CV. If you already have a lot of experience in the industry then the CV improvement may not be worth the time and effort invested in an unsuccessful company.

When I was 22 I (along with two business partners) started an Internet cafe. It went reasonably well (by the standards of small businesses), it lasted for a few years before cheap net access at home killed most of the business. At the time the cafe had to close the ISP side of the business was doing reasonably well and one of my partners bought the operating ISP business. This buy-out caused me to approximately break even out of the entire business which is a lot better than most small businesses do. When I was 26 I moved to London (I have dual nationality, UK and Australian). The experience I had gained from running my own business allowed me to immediately get contract work for large ISPs in Europe.

Most of the risks in my career were ones that I took while living with my parents. At the time I didn’t think through the issues of mortgages etc, my thinking was mostly along the lines of “it could work, I’m bored, so why not?”. ;)

Update: While in the process of writing this blog post I forwarded the URL of a dating service for scientists (sciconnect.com) to some friends. The main page has pictures of single people wearing lab coats and using laptops which I found amusing. I have no idea whether it’s a good service or not, but the pictures on the main page made it worth a look. It seems that I accidentally pasted the wrong URL into my blog post so people who were looking for the Paul Graham article ended up at the dating service instead. But I guess if you are the type of person who reads my blog and who is interested in a link to Paul Graham’s blog and you happen to be single then a dating service for scientists might be of some interest.

Thanks to MJ Ray for pointing out my error.

3

permalinks in wordpress, Apache redirection, and other blog stuff

When I first put my new blog online I didn’t think to set the custom permalinks option to avoid having /index.php in all URLs (which wastes a few bytes and looks nasty).

So I decided to change to better URLs but unfortunately many people have already bookmarked the bad URLs. I wanted to give a HTTP 301 redirection when someone uses the old index.php version (so that bookmarks get updated) and then redirect to the PHP file. Unfortunately having a redirection from ^/index.php to a version without it and then a local rewrite to include index.php again doesn’t seem to work (any advice would be appreciated). So I put the following in my /etc/wordpress/htaccess file (the location for such things in Debian) so that foo.php is used instead where foo.php is a sym-link to index.php. I’m wondering whether I should file a bug report against the Debian package requesting that a sym-link be in the package to facilitate such things – if it’s not possible to do what I desire without the symlink.

RewriteEngine On
RewriteBase /
#RewriteCond %{REQUEST_URI} ^/index.php/?(.*$) [NC]
#RewriteRule . /%1 [R=301,L]
RewriteCond ^/robots.txt [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
#RewriteRule . /foo.php%1 [L]
RewriteRule . /index.php%1 [L]

Update: I am now using the permalink-redirect plugin (thanks for the tip Method) which solves the problem of the obsolete URLs as well as solving the problem of having two representations of the URL (with and without a trailing slash). I have updated the above htaccess file sample to reflect my new configuration (with the old settings commented out for the benefit of people who don’t want the permalink-redirect plugin).

The way WordPress allows the table prefix to be stored in the MySQL configuration section is very handy. Some time ago I asked for advice on a blog server for multiple users and WordPress-MU was recommended, but it seems that for most situations where you want multiple blogs the non-MU version of WordPress will do the job. It seems that the main benefit of WordPress-MU is that setting up multiple blogs doesn’t require running shell scripts, which for the cases I’m most interested in doesn’t compete with the benefit that the non-MU version has of being packaged in Debian.

On the topic of WordPress in Debian, it’s a pity that none of the plugins are packaged in Debian. I plan to create a repository for plugins and themes that I use if no-one else has started such a repository. I believe that a repository of Debian packages for such things will provide significant benefits to users, including updates for security reasons and having plugins that are known to work (some of the plugins appear to only work on Windows).

Also there are a few issues that I would like to improve in WordPress. One is that the Uncategorised category is selected by default so if I select another category and forget to de-select Uncategorised then it’s a little confusing. Another is that the categories are displayed in the side-bar without mentioning the number of matching posts. The way blogger lists the number of posts per category (and sorts the categories in order) is much more convenient. Also another advantage of blogger is the handling of archives where you can click on a month to see a list of the names of all posts in that month. I’m not about to go back, but it would be nice to have those features. Does anyone have any ideas how to solve these problems?

Update2:
I have added a rule to make robots.txt not redirect. Before adding this rule /robots.txt was redirected to /index.php/robots.txt which caused a WordPress page to load, this wasted a lot of bandwidth (robots.txt is hit often) and probably caused some spiders to ignore my site.

7

school rating

The web site http://au.ratemyteachers.com/ allows Australian students to rate their teachers. Ratings are anonymous and give teachers a score out of 5 as well as allowing students to comment on teachers.

The Sydney Morning Herald has an article about the site that describes the actions that the NSW Department of Education and the NSW Teachers Federation are taking to block the site.

The solution to this however is really quite simple. There needs to be a formal method for students to rate their teachers which will be used when it comes time to give pay rises to good teachers and dismiss or transfer to non-teaching duties the teachers who can’t do their job.

I encourage students to submit essays and debate topics about the anonymous news-papers published in the Soviet Union and other repressive states, why they were necessary (because criticism of the government was prohibited) and why they were morally right (a system with no method of correction will inevitably do bad things). Then teachers will have a choice of supporting the actions of the Soviet Union or the use of ratemyteacher.com, it will be interesting to see which option they choose. I think that it’s most likely that they will take the hypocritical path and support anonymous newspapers in the Soviet Union while attacking such free speech in supposedly free countries.

It’s interesting that an article on the failures of Mentone Grammar has just been published. Maybe if Mentone had been listed on the ratemyteachers.com site the Taylor’s would not have made the mistake of sending their son there. Or maybe if the Mentone senior staff had been reading that site they would have been able to correct the problems before they became cause for a legal dispute.

more about Heartbeat

In a comment on my blog post “a Heartbeat developer comments on my blog post” Alan Robertson writes:
I got in a hurry on my math because of the emergency. So, there are even more assumptions (errors?) than I documented.
In particular, the probability model I gave was for a particular node to fail. So the probability of either of two failing would be double that, and either of three failing would be triple that.
Note that the probability of multiple simultaneous failures goes up as a power, but the probability of either of only goes up linearly.
I really need to sit down and do the math carefully – but the idea of the simultaneous failures going up as a power is true. And the “any of” probability goes up linearly. That’s also true. This is why people can actually use larger HA clusters ;-).
The 5 years figure is the industry standard quoted figure for an average Intel-based server to fail.
The four hours to repair is a common high-quality of service response time from a hardware vendor. I admit that’s not the same as actual repair time, but if some “repairs” are just reboots, then it’s not a horrible number to start with – if your vendor has cached some spares nearby. I suppose I should sit down and do the math right, and make a spreadsheet of it. (I wonder if I remember that much math?)
I assume disk failures are taken care of by hot swap disks, RAID, etc. and so in effect they “never fail” (at least not totally) so that these failures don’t have to be accounted for by the overall availability model.
Here’s an intuitive way of thinking about it “from your gut”…
If I took your whole data center and made a cluster out of it, what’s the chance that at least half of your servers would fail at once?
Pretty darn small, is the short answer ;-). If it’s not pretty darn small, you need to buy better servers, and IBM has just the servers for you ;-). Or maybe they need to hire a better SysAdmin ;-)
If you ask yourself “when is the last time at least half my machines in my data center couldn’t communicate with the other half”, then hopefully that’s also a “pretty darn small” chance too. If not, there are well-known methods for making networks highly reliable too.
[I’m still ignoring “catastrophes” that you haven’t accounted for in your HA architecture].
I’m not saying this is free, and it can be pricey. One of my other favorite sayings is “Paranoia is an expensive hobby”. How much do you want to spend?
You tell me how much you want to spend, and you can figure out how to spend it.
I’ll make a separate comment on quorum models later. It’s getting late here.

My only comment in response to that is to say that I still believe the calculations of probability to be correct in my original post and I am interested to see someone prove otherwise.

Another comment by Alan:
Data corruption, no doubt, is almost always much worse than loss of availability. And some kinds of data corruption are worse than others. For example, mounting a non-clustered shared disk filesystem twice simultaneously is usually much worse, than updating two replicas of the data simultaneously. In the first case, you have to restore to your previous backups and lose all data since then. In the second case, you only lose updates that were made to one of the sides, and you instantly have a working copy of the data which is nearly always much newer than your last backup (with the possibility of recovering them by significant effort). Typically you would only lose a few minutes of updates at worst – and depending on the kind of networking failure, you might not lose anything.
Heartbeats certainly aren’t enough. You need to monitor the health of your servers and the health of your applications. Heartbeat monitors applications and can easily be informed of and act on the health of your servers (with release 2 style Linux-HA Heartbeat configurations).

Followed by:
Since data corruption is so serious, this is why cluster designers worry so much about split-brain, which is managed using the ideas of quorum and it’s sibling fencing.
This is all about keeping bad things from happening.
This post is really about quorum, since Russ had expressed interest in it.
Quorum is the idea that you can uniquely choose a subcluster to represent the whole cluster in those cases where communication failure has caused the cluster to split into separate sub-clusters which cannot properly communicate with each other. In this way, only one of the subclusters continues on, and the others will sit on their hands and do nothing waiting for a person to fix things.
Some of the kinds of quorum mentioned below are better than others. But, most importantly, they can be used in combination as described later.
The most common kind of quorum is that Russ mentioned in his earlier post – the majority quorum. In this method, for a cluster of n nodes, you grant quorum to a sub-cluster which has more than INT(n/2) members. This means that if you have a 3-node cluster, you have to have two nodes to continue. If you have 4 nodes, you have to have 3 nodes to continue. For 5 nodes, you have to have 3 nodes, and so on.
Other basic methods include disk reserve, so that you have reserve a disk to have quorum. In this case, if only one node survives and it can reserve the disk, it continues to run. However, the disk becomes a single point of failure. This may not be a problem if this single disk is required to run any of the cluster services, since they would fail without it anyway. [Heartbeat does not support this method].
An analagous method is to implement a software resource which grants quorum to one subcluster in a fashion analagous to the disk reserve method. This has the advantage of not requiring disk reserves, or a shared disk, but it has the same SPOF disadvantage as the disk reserve method. Heartbeat does support this method using the quorum daemon. It’s incredibly useful for those cases (like split-site clusters) where you cannot use fencing.
Another method is to grant quorum to any subcluster which can ping a certain set of nodes, and not grant it to any which can’t access those nodes. This isn’t a wonderful method, and has obvious disadvantages with respect to uniqueness, and single points of failure. (Heartbeat doesn’t yet implement this one).
Another method is to grant quorum to any node which is a member of a 2-node cluster. This is better than losing quorum and stopping when one node stops, but obviously completely ignores the uniqueness requirement of quorum.
Another method is to ask a human being if you have quorum. This is hardly an ideal circumstance, but useful in some contexts as described below. (Heartbeat doesn’t yet implement this one).
Perhaps you say, really the only one of these that’s really good is the first one – the majority vote method.
And, I would generally agree with you. But, Heartbeat has the ability to use these in combination which makes some of those methods that seem flaky to be much more reasonable.
Heartbeat has the ability to have multiple quorum modules declared, and they’re used in this way: Any module can return HAVEQUORUM, NOQUORUM, or TIE. If they return HAVEQUORUM or NOQUORUM, then no further quorum modules are consulted. However, if they return TIE, then the next quorum module is consulted for its opinion. If the last quorum module returns TIE, it is treated the same as NOQUORUM.
This enables you to use one quorum module to break the tie declared by a previous quorum module.
You could then use the quorumd to break the tie created by a voting module. Or you could use the quorumd instead of the “two-node” module. Or you could use the “pingable” module instead of the “two-node” module. Or you could at the end always tack on a “human” module, in case all else returns TIE.
This is kind of cool, actually. My favorites for next implementation are the pingable and consult human modules.
And, of course, if your cluster loses quorum due to real server failures failures, there are always ways to work around it, with a little human intervention. One method is to tell Heartbeat to ignore quorum. Another is to tell Heartbeat to remove certain nodes from the cluster, after you verify that they’re really dead. And, I’m sure that in a pinch, some new methods will be invented. And some of them might actually work ;-).

Regarding the quorumd, it seems that this is an extra server that will generally run on another machine separate from the rest of the cluster. So if we had a two-node cluster with a quorumd then it would effectively be a three-node cluster where one node is not configured to run any resources. It seems that the simpler approach in many cases would be to merely have a three-node cluster with resources not configured to run on one of the nodes.

For example if I was running a mail server cluster for an ISP I might configure a three node cluster of the two mail server back-end machines and one other machine that is lightly loaded (EG a DNS server) and have it configured not to run MTA resoures on the DNS machine.

first look at CentOS 5 Xen

I have just installed a machine running CentOS 5 as a Xen server. I installed a full GUI environment on the dom0 so that GUI tools can be used for managing the virtual servers.

The first problem I had was selecting the “Installation source”, it’s described in the error message as an “Invalid PV media address” when you get it wrong which caused me a little confusion when installing it at 10PM. Then I had a few problems getting the syntax of a nfs://1.2.3.4:/directory URL correct. But these were trivial annoyances. It was a little annoying that my attempts to use a “file://” URL were rejected, I had hoped that it would just run exportfs to make the NFS export from the local machine (much faster than using an NFS server over the network which is what the current setup will lead people to do).

The first true deficiency I found with the tools is that it provides no way of creating filesystems on block devices. The process of allocating a block device or file from the Xen configuration tool is merely assigning a virtual block device to the Xen image – and only one such virtual block device is permitted. Then the CentOS 5 installation instance that runs under Xen will have to partition the disk (it doesn’t support installing directly to an unpartitioned disk) which will make things painful when it comes time to resize the filesystems.

When running Debian Xen servers I do everything manually. A typical Debian Xen instance that I run will have a virtual block device /dev/hda for the root FS, /dev/hdb for swap, and /dev/hdc for /home. Then if I want to resize them I merely stop the Xen instance, run “e2fsck -f” on the filesystem followed by “resize2fs” and the LVM command “lvresize” (in the appropriate order depending on whether I am extending or reducing the filesystem).

Xen also supports creating a virtual partitioned disk. This means I could have /dev/lvm/xenroot, and /dev/lvm/xenswap, and /dev/lvm/xenhome appear in the domU as /dev/hda1, /dev/hda2, and /dev/hda3. This means that I could have a single virtual disk that allows the partitions to be independently resized when the domU in question is not running. I have not tried using this feature as it doesn’t suit my usage patterns. But it’s interesting and unfortunate that the GUI tools which are part of CentOS don’t support it.

When I finally got to run the install process it had a virtual graphics environment (which is good) but unfortunately it suffered badly from the two-mouse-cursor problem with different accellerations used for both cursors so the difference in position of the two cursors varied in different parts of the screen. This was rather surprising as the dom0 had a default GNOME install.

the Right to Fork

Leon Brooks blogged about the Right to Fork (an essential right for free software development) but notes that governments of countries don’t permit such a right.

One of the criteria for the existence of a state is the ability to control it’s own territory. Lose control of the territory and you lose the state, lose some of the territory and the state is diminished. Therefore preventing a division of the territory (a split after a civil war) is the primary purpose of a state. The other criteria of a state are the ability to tax the population, impose civil order, and to administer all other aspects of government. All of these operations are essential to the government and lead to the destruction of the state if they are lost.

It’s not that governments want to prevent forking, it’s the fact that the existence of the state (on which the existence of the government depends) demands that it be prevented in all but the most extreme situations.

With free software forking is not a problem as multiple groups can work on similar software without interference. If someone else works on a slightly different version of your program then the worst that they can do is to get the interest of more developers than you get. This competition for developers leads to better code!

With proprietary software the desire to prevent forking is due to the tiny marginal cost of software. Most of the costs of running a software company are in the development. The amount of work involved in development does not vary much as the user-base is increased. So doubling the number of sales can always be expected to significantly more than double the company’s profit.

One thing that would benefit the computer industry would be to have all the source to proprietary programs put in escrow and then released freely after some amount of time or some number of versions have been released. If Windows NT 4.0 was released freely today it would not take many sales from the more recent versions of Windows. But it would provide significant benefits for people who want to emulate older systems and preserve data. I expect that current versions of MS-Office wouldn’t properly read files created on NT 4.0, I’m sure that this is a problem for some people and will become more of a problem as new machines that are currently being designed are not capable of booting such old versions of Windows.

2 node vs 3+ node clusters

A comment on my post about the failure probability of clusters suggested that a six node cluster that has one node fail should become a five node cluster.

The problem with this is what to do when nodes recover from a failure. For example if a six node cluster had a node fail and became a five node cluster, then became a three node cluster after another two nodes had failed, then you would have half the cluster that was disconnected. If the three nodes that appeared to have failed became active again but unable to see the other three nodes then you would have a split-brain situation.

As noted in the comment the special case of a two node cluster does have different failure situations. If the connection between nodes goes down and the router can still be pinged then you can have a split brain situation. To avoid this you will generally have a direct connection between the two nodes (either a null-modem cable or a crossover Ethernet cable), such cables are more reliable than networking which involves a switch or hub. Also the network interface which involves the router in question will ideally also be used as a method of maintaining cluster status – it seems unlikely that two nodes will both be able to ping the router but be unable to send data to each other.

For best reliability you need to use multiple network interfaces between cluster nodes. One way of doing this is to have a pair of Ethernet ports bonded for providing the service (connected to two switches and pinging a router to determine which switch is best to use). The Heartbeat software supports encrypted data so it should be safe to run it on the same interface as used for providing the service (of course if you provide a service to the public Internet then you want a firewall to prevent machines on the net from trying to attack it).

Heartbeat also supports using multiple interfaces for maintaining the cluster data, so you can have one network dedicated to cluster operations and the network that is used for providing the service can be a backup network for cluster data. The pingd service allows Heartbeat to place services on nodes that have good connectivity to the net. So you could have multiple nodes that each have one Ethernet port for providing the service and one port as a backup for Heartbeat operations, if pingd indicates that the service port was not functioning correctly then the services would be moved to other nodes.

If you want to avoid having private Heartbeat data going over the service interface then in the two-node case you need a minimum of two Ethernet ports for Heartbeat and one port for providing the service if you use pingd. If you don’t use pingd then you need two bonded ports for providing the service and two ports (either bonded or independently configured in Hertbeat) for Heartbeat giving a total of four ports.

When there are more than two nodes in the cluster the criteria for cluster membership is that a majority of nodes are connected. This makes split-brain impossible and reduces the need to have reliable Ethernet interfaces. A cluster with three or more nodes could have a single service port and a single private port for Heartbeat, or if you trust the service interface you could do it all on one Ethernet port.

In summary, three nodes is better than two, but requires more hardware. Five nodes is better than three, but as I wrote in my previous post four nodes is not much good. I recommend against any even number of nodes other than two for the same reason.