Archives

Categories

first look at CentOS 5 Xen

I have just installed a machine running CentOS 5 as a Xen server. I installed a full GUI environment on the dom0 so that GUI tools can be used for managing the virtual servers.

The first problem I had was selecting the “Installation source”, it’s described in the error message as an “Invalid PV media address” when you get it wrong which caused me a little confusion when installing it at 10PM. Then I had a few problems getting the syntax of a nfs://1.2.3.4:/directory URL correct. But these were trivial annoyances. It was a little annoying that my attempts to use a “file://” URL were rejected, I had hoped that it would just run exportfs to make the NFS export from the local machine (much faster than using an NFS server over the network which is what the current setup will lead people to do).

The first true deficiency I found with the tools is that it provides no way of creating filesystems on block devices. The process of allocating a block device or file from the Xen configuration tool is merely assigning a virtual block device to the Xen image – and only one such virtual block device is permitted. Then the CentOS 5 installation instance that runs under Xen will have to partition the disk (it doesn’t support installing directly to an unpartitioned disk) which will make things painful when it comes time to resize the filesystems.

When running Debian Xen servers I do everything manually. A typical Debian Xen instance that I run will have a virtual block device /dev/hda for the root FS, /dev/hdb for swap, and /dev/hdc for /home. Then if I want to resize them I merely stop the Xen instance, run “e2fsck -f” on the filesystem followed by “resize2fs” and the LVM command “lvresize” (in the appropriate order depending on whether I am extending or reducing the filesystem).

Xen also supports creating a virtual partitioned disk. This means I could have /dev/lvm/xenroot, and /dev/lvm/xenswap, and /dev/lvm/xenhome appear in the domU as /dev/hda1, /dev/hda2, and /dev/hda3. This means that I could have a single virtual disk that allows the partitions to be independently resized when the domU in question is not running. I have not tried using this feature as it doesn’t suit my usage patterns. But it’s interesting and unfortunate that the GUI tools which are part of CentOS don’t support it.

When I finally got to run the install process it had a virtual graphics environment (which is good) but unfortunately it suffered badly from the two-mouse-cursor problem with different accellerations used for both cursors so the difference in position of the two cursors varied in different parts of the screen. This was rather surprising as the dom0 had a default GNOME install.

lemonup and blog license

I have just updated my previous post about licenses and also explicitely licensed my blog. Previously I had used a Creative-Commons share-alike license for lecture notes to allow commercial use and had not specified what the license is for my blog apart from it being free for feeds (you may add it to a planet without seeking permission first).

Unfortunately the operators of a site named lemonup.com decided to mirror many of my blog posts with Google AdWords. The site provides no benefit to users that I can discover and merely takes away AdWords revenue from my site. It has no listed method of contacting the site owner so it seems that blogging about this and letting them read it on their own site is the only way of doing so. :-#

I’m happy for Technorati to mirror my site as they provide significant benefits to users and to me personally. I am also happy for planet installations that include my blog among others to have a Google advert on the page (in which case it’s a Google advert for the entire planet not for my blog post).

Also at this time I permit sites to mirror extracts of my articles. So for example the porn blogs that post paragraphs of my posts about topics such as “meeting people” with links to my posts don’t bother me. I’m sure that someone who is searching for porn will not be happy to get links to posts about Debian release parties etc – but that’s their QA issue not a license issue. I am aware that in some jurisdictions I can not prevent people from using extracts of my posts – but I permit this even in jurisdictions where such use is not mandated by law.

Lemonup: you may post short extracts (10% or one paragraph) of my posts with links to the original posts, or you may mirror my posts with no advertising at all. If those options are not of interest to you then please remove all content I wrote from your site.

the Right to Fork

Leon Brooks blogged about the Right to Fork (an essential right for free software development) but notes that governments of countries don’t permit such a right.

One of the criteria for the existence of a state is the ability to control it’s own territory. Lose control of the territory and you lose the state, lose some of the territory and the state is diminished. Therefore preventing a division of the territory (a split after a civil war) is the primary purpose of a state. The other criteria of a state are the ability to tax the population, impose civil order, and to administer all other aspects of government. All of these operations are essential to the government and lead to the destruction of the state if they are lost.

It’s not that governments want to prevent forking, it’s the fact that the existence of the state (on which the existence of the government depends) demands that it be prevented in all but the most extreme situations.

With free software forking is not a problem as multiple groups can work on similar software without interference. If someone else works on a slightly different version of your program then the worst that they can do is to get the interest of more developers than you get. This competition for developers leads to better code!

With proprietary software the desire to prevent forking is due to the tiny marginal cost of software. Most of the costs of running a software company are in the development. The amount of work involved in development does not vary much as the user-base is increased. So doubling the number of sales can always be expected to significantly more than double the company’s profit.

One thing that would benefit the computer industry would be to have all the source to proprietary programs put in escrow and then released freely after some amount of time or some number of versions have been released. If Windows NT 4.0 was released freely today it would not take many sales from the more recent versions of Windows. But it would provide significant benefits for people who want to emulate older systems and preserve data. I expect that current versions of MS-Office wouldn’t properly read files created on NT 4.0, I’m sure that this is a problem for some people and will become more of a problem as new machines that are currently being designed are not capable of booting such old versions of Windows.

praying for rain

Paul Dwerryhouse posted a comment about the Prime Minister asking people to pray for rain. I don’t think that Johnny is suggesting this because he’s overly religious (compare his actions with the New Testament of the Bible). The fact is that the Australian government has no plans to deal with global warming, the inefficient distribution of water, and the large commercial farms that produce water inefficient crops such as rice and cotton in areas that have limited amounts of water. This means that small farmers should pray, no-one else will help them!

I wonder if the farmers will ever work out that the National party is doing absolutely nothing for them by it’s alliance with the Liberal party. Maybe if farmers could actually get a political party that represents their interests then things would change.

a Heartbeat developer comments on my blog post

Alan Robertson (a major contributor to the Heartbeat project) commented on my post failure probability and clusters. His comment deserves wider readership than a comment generally gets so I’m making a post out of it. Here it is:

One of my favorite phrases is “complexity is the enemy of reliability” . This is absolutely true, but not a complete picture, because you don’t actually care much about reliability, you care about availability.
Complexity (which reduces MTBF) is only worth it if you can use it to drastically cut MTTR – which in turn raises availability significantly. If your MTTR was 0, then you wouldn’t care if you ever had a failure. Of course, it’s never zero
But, with normal clustering software, you can significantly improve your availability, AND your maintainability.
Your post makes some assumptions which are more than a little simplistic. To be fair, the real mathematics of this are pretty darn complicated.
First I agree that there are FAR more 2-node clusters than larger clusters. But, I think for a different reason. People understand 2-node clusters. I’m not saying this isn’t important, it is important. But, it’s not related to reliability.
Second, you assume a particular model of quorum, and there are many. It is true that your model is the most common, but it’s hardly the only one – not even for heartbeat (and there are others we want to implement).
Third, if you have redundant networking, and multiple power sources, as it should, then system failures become much less correlated. The normal model which is used is completely uncorrelated failures.
This is obviously an oversimplification as well, but if you have redundant power supplies supplied from redundant power feeds, and redundant networking etc. it’s not a bad approximation.
So, if you have an MTTR of 4 hours to repair broken hardware, what you care about is the probability of having additional failures during those four hours.
If your HA software can recover from an error in 60 seconds, then that’s your effective MTTR as seen by (a subset) of users. Some won’t see it at all. And, of course, that should also go into your computation. This depends on knowing a lot about what kind of protocol is involved, and what the probability of various lengths of failures is to be visible to various kinds of users. And, of course, no one really knows that either in practice.
If you have a hardware failure every 5 years approximately, and a hardware repair MTTR of 4 hours, then the probability of a second failure during that time is about .009%. The probability of two failures occuring during that time is about 8^10-7% – which is a pretty small number.
Probabilities for higher order failures are proportionately smaller.
But, of course, like any calculation, the probabilities of this are calculated using a number of simplifying assumptions.
It assumes, for example, that the probabilities of correlated failures are small. For example, the probability of a flood taking out all the servers, or some other disaster is ignored.
You can add complexity to solve those problems too ;-), but at some point the managerial difficulties (complexity) overwhelms you and you say (regardless of the numbers) that you don’t want to go there.
Mangerial complexity is minimized by uniformity in the configuration. So, if all your nodes can run any service, that’s good. If they’re asymmetric, and very wildly so, that’s bad.
I have to go now, I had a family emergency come up while I was writing this. Later…

End quote.

It’s interesting to note that there are other models of quorum, I’ll have to investigate that. Most places I have worked have had a MTTR that is significantly greater than four hours. But if you have hot-swap hard drives (so drive failure isn’t a serious problem) then having machines have an average of one failure per five years should be possible.

2 node vs 3+ node clusters

A comment on my post about the failure probability of clusters suggested that a six node cluster that has one node fail should become a five node cluster.

The problem with this is what to do when nodes recover from a failure. For example if a six node cluster had a node fail and became a five node cluster, then became a three node cluster after another two nodes had failed, then you would have half the cluster that was disconnected. If the three nodes that appeared to have failed became active again but unable to see the other three nodes then you would have a split-brain situation.

As noted in the comment the special case of a two node cluster does have different failure situations. If the connection between nodes goes down and the router can still be pinged then you can have a split brain situation. To avoid this you will generally have a direct connection between the two nodes (either a null-modem cable or a crossover Ethernet cable), such cables are more reliable than networking which involves a switch or hub. Also the network interface which involves the router in question will ideally also be used as a method of maintaining cluster status – it seems unlikely that two nodes will both be able to ping the router but be unable to send data to each other.

For best reliability you need to use multiple network interfaces between cluster nodes. One way of doing this is to have a pair of Ethernet ports bonded for providing the service (connected to two switches and pinging a router to determine which switch is best to use). The Heartbeat software supports encrypted data so it should be safe to run it on the same interface as used for providing the service (of course if you provide a service to the public Internet then you want a firewall to prevent machines on the net from trying to attack it).

Heartbeat also supports using multiple interfaces for maintaining the cluster data, so you can have one network dedicated to cluster operations and the network that is used for providing the service can be a backup network for cluster data. The pingd service allows Heartbeat to place services on nodes that have good connectivity to the net. So you could have multiple nodes that each have one Ethernet port for providing the service and one port as a backup for Heartbeat operations, if pingd indicates that the service port was not functioning correctly then the services would be moved to other nodes.

If you want to avoid having private Heartbeat data going over the service interface then in the two-node case you need a minimum of two Ethernet ports for Heartbeat and one port for providing the service if you use pingd. If you don’t use pingd then you need two bonded ports for providing the service and two ports (either bonded or independently configured in Hertbeat) for Heartbeat giving a total of four ports.

When there are more than two nodes in the cluster the criteria for cluster membership is that a majority of nodes are connected. This makes split-brain impossible and reduces the need to have reliable Ethernet interfaces. A cluster with three or more nodes could have a single service port and a single private port for Heartbeat, or if you trust the service interface you could do it all on one Ethernet port.

In summary, three nodes is better than two, but requires more hardware. Five nodes is better than three, but as I wrote in my previous post four nodes is not much good. I recommend against any even number of nodes other than two for the same reason.

failure probability and clusters

When running a high-availability cluster of two nodes it will generally be configured such that if one node fails then the other runs. Some common operation (such as accessing a shared storage device or pinging a router) will be used by the surviving node to determine that the other node is dead and that it’s not merely a networking problem. Therefore if you lose one node then the system keeps operating until you lose another.

When you run a three-node cluster the general configuration is that a majority of nodes is required. So if the cluster is partitioned then one node on it’s own will shut down all services while two nodes that can talk to each other will continue operating as normal. This means that to lose the cluster you need to lose all inter-node communication or have two nodes fail.

If the probability of a node surviving for the time interval required to repair a node that’s already died is N (where N is a number between 0 and 1 – 1 means 100% chance of success and 0 means it is certain to fail) then for a two node cluster the probability of the second node surviving long enough for a dead node to be fixed is N. For a three node cluster the probability that both the surviving two nodes will survive is N^2. This is significantly less, therefore a three node cluster is more likely to experience a critical second failure than a two node cluster.

For a four node cluster you need three active nodes to have quorum. Therefore the probability that a second node won’t fail is N^3 – even worse again!

For a five node cluster you can lose two nodes without losing the cluster. If you have already lost a node the probability that you won’t lose another two is N^4+(1-N)*N^3*4. As long as N is greater than 0.8 the probability of keeping three nodes out of four is greater than the probability of a single node not failing.

To see the probabilities of four and five node clusters experiencing a catastrophic failure after one node has died run the following shell script for different values of N (0.9 and 0.99 are reasonable values to try). You might hope that the probability of a second node remaining online while the first node is being repaired is significantly higher than 0.9, however when you consider that the first node’s failure might have been partially caused by the ambient temperature, power supply problems, vibration, or other factors that affect multiple nodes I don’t think it’s impossible for the probability to be as low as 0.9.

echo $N^4+\(1-$N\)*$N^3*4|bc -l ; echo $N^3 | bc -l

So it seems that if reliability is your aim in having a cluster then your options are two nodes (if you can be certain of avoiding split-brain) or five nodes. Six nodes is not a good option as the probability of losing three nodes out of six is greater than the probability of losing three nodes out of five. Seven and nine node clusters would also be reasonable options.

But it’s not surprising that a google search for “five node” cluster high-availability gives about 1/10 the number of results as a search for “four node” cluster high-availability. Most people in the computer industry like powers of two more than they like maths.

Debian/Etch release party in Melbourne – Australia

We are having a release party on Saturday the 14th of April. We meet at mid-day under the clocks at Flinders Street Station and then go somewhere convenient and not too expensive for lunch.

All welcome.

Update:

The event was moderately successful. There were only six people including me – that was quite a bit smaller than the Debian 10th birthday party we had in Melbourne, but it was still enough to have fun.

Everyone there had a good knowledge of Linux and Debian and many interesting things were discussed. We had lunch at a Japanese stone-grill restaurant – their specialty is serving raw ingredients along with a stone that’s at 400C (or so they claim – I would expect a 400C stone to radiate more heat than I experienced on my previous visit). As it was a warm day we skipped the stone grill and ordered from the lunch menu (which was also a lot cheaper). Some of the guys had never tried Sake or Plum Wine before, they seemed to like it. Strangely the waitress always wanted to deliver alcohol to a 15yo in preference to almost anyone else.

One of the topics of discussion was Linux meetings and the ability to attend them. A point was made that if you are <18yo and rely on your parents’ permission to do things then a meeting that finishes at 9PM isn’t a viable option. It has previously been noted that for people from regional areas an evening meeting is also inconvenient.

Maybe we should have occasional LUG meetings on a Saturday afternoon to cater for the needs of such people?

Spooks and GConf

Jeff Waugh wrote an amusing post about SE Linux and GConf support. It’s good to see SE Linux being promoted to the GNOME community.

presentations about SE Linux

I have just read the Presentation Zen blog post about PowerPoint.

One of the interesting suggestions was that it’s not effective to present the same information twice, so you don’t have notes covering what you say. Having a diagram that gives the same information is effective though because it gives a different way of analyzing the data. I looked at a couple of sets of slides that I have written and noticed that the ratio of text slides to diagram slides was 6:1 and 3:1 in favor of text, and that wasn’t counting the first and last slides that have the title of the talk and a set of URLs respectively.

So it seems that I need more and better diagrams. I’ll include most of the diagrams I use in my current SE Linux talks in this post with some ideas on how to improve them. I would appreciate any suggestions that may be offered (either through blog comments or email).

The above diagram shows how the SE Linux identity limits the roles that may be selected, and how the role limits the domains that may be entered. Therefore the identity controls what the user may do and in this example the identity “root” means that the user has little access to the machine (a Play Machine configuration). I think that the above is reasonably effective and have been using it for a few years. I have considered a more complex diagram with the “staff_r” role included as well and possibly including the way that “newrole” can be used to change between roles. So I could have the above as slide #1 about identities and roles with a more detailed diagram following to replace a page of text about role transition.

The above diagram shows the domain transitions used in a typical system boot and login process. It includes the names of the types and a summary of the relevant policy rules used to implement the transitions. I also have another diagram that I have used which is the same but without the file types and policy. In the past I have never used both in the one talk – just used one of the two and had text to describe the information content of the other. To make greater use of diagrams I could start with the simple diagram and then have the following slide have all the detail.

The above diagram simply displays the MCS security model with ellipses representing processes and rectangles representing files.

The above diagram shows a simplified version of the MMCS policy. With MMCS each process has a range with the low level representing the minimum category set of files to which it is permitted to write and the high level representing the files that it may read and write. So to write to a file with the “HR” category the process must have a low level that’s no higher than “HR” and a high level that is equal or greater than “HR“. The full set of combinations of two categories with low and high levels means 10 different levels of access for processes which makes for a complex diagram. I need something other than plain text for this but the above diagram is overly complex and a full set is even more so. Maybe a table with process contexts on one axis, file contexts on another and access granted being one of “R“, “RW” or nothing?

I also have a MLS diagram in the same manner, but I now think it’s too awful to put on my blog. Any suggestions on how to effectively design a diagram for MLS? For those of you who don’t know how MLS works the basic concept is that every process has an “Effective Clearance” (AKA low level) which determines what it can write, it can’t write to anything below that because it might have read data from a file at it’s own level and it can’t read from a level higher than it’s own level. MLS also uses a high level for ranged processes and filesystem objects (but that’s when it gets really complex).

This last one is what I consider my most effective diagram. It shows the benefits of SE Linux in confining daemons in a clear and effective manner. Any suggestions for improvement (apart from fixing the varying text size which is due to a bug in Dia) would be appreciated.

The above diagrams are all on my SE Linux talks page, along with the Dia files that were used to create them. They may be used freely for non-commercial purposes.

If anyone has some SE Linux diagrams that they would like to share then please let me know, either through a blog comment, email, or a blog post syndicated on Planet SE Linux.