Linux, politics, and other interesting things


  • Category Archives Ha
  • Some Notes on DRBD

    DRBD is a system for replicating a block device across multiple systems. It’s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it’s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.

    I’m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I’m not using DRBD because the failure conditions that I’m planning for don’t include two disks entirely failing. I’m planning for a system having an outage for a while so it’s OK to have some inbound and outbound mail delayed but it’s not OK for the mail store to be unavailable.

    Global changes I’ve made in /etc/drbd.d/global_common.conf

    In the common section I changed the protocol from “C” to “B“, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don’t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.

    In the net section I added the following:

    sndbuf-size 512k;
    data-integrity-alg sha1;

    This uses a larger network sending buffer (apparently good for fast local networks – although I’d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).

    Reserved Numbers

    The default port number that is used is 7789. I think it’s best to use ports below 1024 for system services so I’ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I’ll use different ports even if they aren’t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.

    Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU’s resources.

    It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.

    As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn’t be mounted locally and which mount options are permitted – it MIGHT be OK to do a read-only mount of a DomU’s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.

    Initial Synchronisation

    After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don’t then it will probably refuse to accept one copy of the data as primary as it can’t seem to realise that both are inconsistent. I can’t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there

    The solution sometimes seems to be to run “drbdsetup /dev/drbd0 primary -” (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running “drbdadm -- --overwrite-data-of-peer primary db0-mysql” (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res).

    Also it seems that some commands can only be run from one node. So if you have a primary node that’s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a “drbdadm connect” command run from the primary node will get things going.


  • Hetzner Failover Konfiguration

    The Wiki documenting how to configure IP failover for Hetzner servers [1] is closely tied to the Linux HA project [2]. This is OK if you want a Heartbeat cluster, but if you want manual failover or an automatic failover from some other form of script then it’s not useful. So I’ll provide the simplest possible documentation.

    Below is a sample of shell code to get the current failover settings and change them to point the IP address to a different server. In my tests this takes between 19 and 20 seconds to complete, when the command completes the new server will be active and no IP packets will be lost – but TCP connections will be broken if the servers don’t support shared TCP state.

    # username and password for the Hetzner robot
    USERPASS=USER:PASS
    # public IP
    IP=10.1.2.3
    # new active server
    ACTIVE=10.2.3.4
    # get current values
    curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP
    # change active server
    curl -s -u $USERPASS https://robot-ws.your-server.de/failover.yaml/$IP -d active_server_ip=$ACTIVE

    Below is the output of the above commands showing the old state and the new state.

    failover:
    ip: 10.1.2.3
    netmask: 255.255.255.255
    server_ip: 10.2.3.3
    active_server_ip: 10.2.3.4
    failover:
    ip: 10.1.2.3
    netmask: 255.255.255.255
    server_ip: 10.2.3.4
    active_server_ip: 10.2.3.4


  • Why Clusters Usually Don’t Work

    Posted on by etbe

    It’s widely regarded that to solve reliability problems you can just install a cluster. It’s quite obvious that if instead of having one system of a particular type you have multiple systems of that type and a cluster configured such that broken systems aren’t used then reliability will increase. Also in the case of routine maintenance a cluster configuration can allow one system to be maintained in a serious way (EG being rebooted for a kernel or BIOS upgrade) without interrupting service (apart from a very brief interruption that may be needed for resource failover). But there are some significant obstacles in the path of getting a good cluster going.

    Buying Suitable Hardware

    If you only have a single server that is doing something important and you have some budget for doing things properly then you really must do everything possible to keep it going. You need RAID storage with hot-swap disks, hot-swap redundant PSUs, and redundant ethernet cables bonded together. But if you have redundant servers then the requirement for making one server reliable is slightly reduced.

    Hardware is getting cheaper all the time, a Dell R300 1RU server configured with redundant hot-plug PSUs, two 250G hot-plug SATA disks in a RAID-1 array, 2G of RAM, and a dual-core Xeon Pro E3113 3.0GHz CPU apparently costs just under $2,800AU (when using Google Chrome I couldn’t add some necessary jumper cables to the list so I couldn’t determine the exact price). So a cluster of two of them would cost about $5,600 just for the servers. But a Dell R200 1RU server with no redundant PSUs, a single 250G SATA disk, 2G of RAM, and a Core 2 Duo E7400 2.8GHz CPU costs only $1,048.99AU. So if a low end server is required then you could buy two R200 servers that have no redundancy built in which cost less than a single server that has hardware RAID and redundant PSUs. Those two servers have different sets of CPU options and probably other differences in the technical specs, but for many applications they will probably both provide more than adequate performance.

    Using a server that doesn’t even have RAID is a bad idea, a minimal RAID configuration is a software RAID-1 array which only requires an extra disk per server. That takes the price of a Dell R200 to $1,203. So it seems that two low-end 1RU servers from Dell that have minimal redundancy features will be cheaper than a single 1RU server that has the full set of features. If you want to serve static content then that’s all you need, and a cluster can save you money on hardware! Of course we can debate whether any cluster node should be missing redundant hot-plug PSUs and disks. But that’s not an issue I want to address in this post.

    Also serving static content is the simplest form of cluster, if you have a cluster for running a database server then you will need a dual-attached RAID array which will make things start to get expensive (or software for replicating the data over the network which is difficult to configure and may be expensive), so while a trivial cluster may not cost any extra money a real-world cluster deployment is likely to add significant expense.

    My observation is that most people who implement clusters tend to have problems getting budget for decent hardware. When you have redundancy via the cluster you can tolerate slightly less expected uptime from the individual servers. While we can debate about whether a cluster member should have redundant PSUs and other expensive features it does seem that using a cheap desktop system as a cluster node is a bad idea. Unfortunately some managers think that a cluster solves the reliability problem and therefore you can just use recycled desktop systems as cluster nodes, this doesn’t give a good result.

    Even if it is agreed that server class hardware is used for all servers so features such as ECC RAM are used you will still have problems if someone decides to use different hardware specs for each of the cluster nodes.

    Testing a Cluster

    Testing a non-clustered server or some servers that use a load-balancing device at the front-end isn’t that difficult in concept. Sure you have lots of use cases and exception conditions to test, but they are all mostly straight-through tests. With a cluster you need to test node failover at unexpected times. When a node is regarded as having an inconsistent state (which can mean that one service it runs could not be cleanly shutdown when it was due to be migrated) it will need to be rebooted which is sometimes known as a STONITH. A STONITH event usually involves something like IPMI to cut the power or a command such as “reboot -nf“, this loses cached data and can cause serious problems for any application which doesn’t call fsync() as often as it should. It seems likely that the vast majority of sysadmins run programs which don’t call fsync() often enough, but the probability of losing data is low and the probability of losing data in a way that you will notice (IE it doesn’t get automatically regenerated) is even lower. The low probability of data loss due to race conditions combined with the fact that a server with a UPS and redundant PSUs doesn’t unexpectedly halt that often means that problems don’t get found easily. But when clusters have problems and start calling STONITH the probability starts increasing.

    Getting cluster software to work in a correct manner isn’t easy. I filed Debian bug #430958 about dpkg (the Debian package manager) not calling fsync() and thus having the potential to leave systems in an inconsistent or unusable state if a STONITH happened at the wrong time. I was inspired to find this problem after finding the same problem with RPM on a SUSE system. The result of applying a patch to call fsync() on every file was bug report #578635 about the performance of doing so, the eventual solution was to call sync() after each package is installed. Next time I do any cluster work on Debian I will have to test whether the sync() code seems to work as desired.

    Getting software to work in a cluster requires that not only bugs in system software such as dpkg be fixed, but also bugs in 3rd party applications and in-house code. Please someone write a comment claiming that their favorite OS has no such bugs and the commercial and in-house software they use is also bug-free – I could do with a cheap laugh.

    For the most expensive cluster I have ever installed (worth about 4,000,000 UK pounds – back when the pound was worth something) I was not allowed to power-cycle the servers. Apparently the servers were too valuable to be rebooted in that way, so if they did happen to have any defective hardware or buggy software that would do something undesirable after a power problem it would become apparent in production rather than being a basic warranty or patching issue before the system went live.

    I have heard many people argue that if you install a reasonably common OS on a server from a reputable company and run reasonably common server software then the combination would have been tested before and therefore almost no testing is required. I think that some testing is always required (and I always seem to find some bugs when I do such tests), but I seem to be in a minority on this issue as less testing saves money – unless of course something breaks. It seems that the need for testing systems before going live is much greater for clusters, but most managers don’t allocate budget and other resources for this.

    Finally there is the issue of testing issues related to custom code and the user experience. What is the correct thing to do with an interactive application when one of the cluster nodes goes down and how would you implement it at the back-end?

    Running a Cluster

    Systems don’t just sit there without changing, you have new versions of the OS and applications and requirements for configuration changes. This means that the people who run the cluster will ideally have some specialised cluster skills. If you hire sysadmins without regard to cluster skills then you will probably end up not hiring anyone who has any prior experience with the cluster configuration that you use. Learning to run a cluster is not like learning to run yet another typical Unix daemon, it requires some differences in the way things are done. All changes have to be strictly made to all nodes in the cluster, having a cluster fail-over to a node that wasn’t upgraded and can’t understand the new data is not fun at all!

    My observation is that the typical experience of having a team of sysadmins who have no prior cluster experience being hired to run a cluster usually involves “learning experiences” for everyone. It’s probably best to assume that every member of the team will break the cluster and cause down-time on at least one occasion! This can be alleviated by only having one or two people ever work on the cluster and having everyone else delegate cluster work to them. Of course if something goes wrong when the cluster experts aren’t available then the result is even more downtime than might otherwise be expected.

    Hiring sysadmins who have prior experience running a cluster with the software that you use is going to be very difficult. It seems that any organisation that is planning a cluster deployment should plan a training program for sysadmins. Have a set of test machines suitable for running a cluster and have every new hire install the cluster software and get it all working correctly. It’s expensive to buy extra systems for such testing, but it’s much more expensive to have people who lack necessary skills try and run your most important servers!

    The trend in recent years has been towards sysadmins not being system programmers. This may be a good thing in other areas but it seems that in the case of clustering it is very useful to have a degree of low level knowledge of the system that you can only gain by having some experience doing system coding in C.

    It’s also a good idea to have a test network which has machines in an almost identical configuration to the production servers. Being able to deploy patches to test machines before applying them in production is a really good thing.

    Conclusion

    Running a cluster is something that you should either do properly or not at all. If you do it badly then the result can easily be less uptime than a single well-run system.

    I am not suggesting that people avoid running clusters. You can take this post as a list of suggestions for what to avoid doing if you want a successful cluster deployment.


  • A Basic IPVS Configuration

    Posted on by etbe

    I have just configured IPVS on a Xen server for load balancing between multiple virtual hosts. The benefit is not load balancing but management. With two virtual machines providing a service I can gracefully shut one down for maintenance and have the other take the load. When there are two machines providing a service a load balancing configuration is much better than a hot-spare, one reason is the fact that there may be application scaling issues that prevent one machine with twice the resources from giving as much performance as two smaller machines. Another is the fact that if you have a machine configured but never used there will always be some doubt as to whether it would work…

    The first thing to do is to assign the IP address of the service to the front-end machine so that other machines on the segment (IE routers) will be able to send data to it. If the address for the service is 10.0.0.5 then the command “ip addr add dev eth0 10.0.0.5/24 broadcast +” will make it a secondary address on the eth0 interface. On a Debian system you would add the line “up ip addr add dev eth0 10.0.0.5/24 broadcast + || true” to the appropriate section of /etc/network/interfaces, for a Red Hat system it seems that /etc/rc.local is the best place for it. I expect that it would be possible to merely advertise the IP address via ARP without adding it to the interface, but the ability to ping the IPVS server on the service address seems useful and there seems no benefit in not assigning the address.

    There are three methods used by IPVS for forwarding packets, gatewaying/routing (the default), IPIP encapsulation (tunneling), and masquerading. The gatewaying/routing method requires the back-end server to respond to requests on the service address. That would mean assigning the address to the back-end server without advertising it via ARP (which seems likely to have some issues for managing the system). The IPIP encapsulation method requires setting up IPIP which seemed like it would be excessively difficult (although maybe not more than required to set up masquerading). The masquerading option (which I initially chose) rewrites the packets to have the IP address of the real server. So for example if the service address is 10.0.0.5 and the back-end server has the address 10.0.1.5 then it will see packets addresses to 10.0.1.5. A benefit of masquerading is that it allows you to use different ports, so for example you could have a non-virtualised mail server listening on port 25 and a back-end server for a virtual service listening on port 26. While there is no practical limit to the number of private IP addresses that you might use it seems easier to manage servers listening on different ports with the same IP address – and there is the issue of server programs that are not written to support binding to an IP address.

    ipvsadm -A -t 10.0.0.5:25 -s lblc -p
    ipvsadm -a -t 10.0.0.5:25 -r 10.0.1.5 -m

    The above two commands create an IPVS configuration that listens on port 25 of IP address 10.0.0.5 and then masquerades connections to 10.0.1.5 on port 25 (the default is to use the same port).

    Now the problem is in getting the packets to return via the IPVS server. If the IPVS server happens to be your default gateway then it’s not a problem and it will already be working after the above two commands (if a service is listening on 10.0.1.5 port 25).

    If the IPVS server is not the default gateway and you have only one IP address on the back-end server then this will require using netfilter to mark the packets and then route based on the packet matching. Marking via netfilter also seems to be the only well documented way of doing similar things. I spent some time working on this and didn’t get it working. However having multiple IP addresses per server is a recommended practice anyway (a back-end interface for communication between servers as well as a front-end interface for public data).

    ip rule add from 10.0.1.5 table 1
    ip route add default via 10.0.0.1 table 1

    I use the above two commands to set up a new routing table for the data for the virtual service. The first line causes any packets from 10.0.1.5 to be sent to routing table 1 (I currently have a rough plan to have table numbers match ethernet device numbers, the data in question is going out device eth1). The second line adds a default router to table 1 which sends all packets to 10.0.0.1 (the private IP address of the IPVS server).

    Then it SHOULD all be working, but in the network that I’m using (RHEL4 DomU and RHEL5 Dom0 and IPVS) it doesn’t. For some reason the data packets from the DomU are not seen as part of the same TCP stream (both in Net Filter connection tracking and by the TCP code in the kernel). So I get an established connection (3 way handshake completed) but no data transfer. The server sends the SMTP greeting repeatedly but nothing is received. At this stage I’m not sure whether there is something missing in my configuration or whether there’s a bug in IPVS. I would be happy to send tcpdump output to anyone who wants to try and figure it out.

    My next attempt at this was via routing. I removed the “-m” option from the ipvsadm command and added the service IP address to the back-end with the command “ifconfig lo:0 10.0.0.5 netmask 255.255.255.255” and configured the mail server to bind to port 25 on address 10.0.0.5. Success at last!

    Now I just have to get Piranha working to remove back-end servers from the list when they fail.

    Update: It’s quite important that when adding a single IP address to device lo:0 you use a netmask of 255.255.255.255. If you use the same netmask as the front-end device (which would seem like a reasonable thing to do) then (with RHEL4 kernels at least) you get proxy ARPs by default. For example you used netmask 255.255.255.0 to add address 10.0.0.5 to device lo:0 then on device eth0 the machine will start answering ARP requests for 10.0.0.6 etc. Havoc then ensues.


  • Kernel Security vs Uptime

    Posted on by etbe

    For best system security you want to apply kernel security patches ASAP. For an attacker gaining root access to a machine is often a two step process, the first step is to exploit a weakness in a non-root daemon or take over a user account, the second step is to compromise the kernel to gain root access. So even if a machine is not used for providing public shell access or any other task which involves giving user access to potential hostile people, having the kernel be secure is an important part of system security.

    One thing that gets little consideration is the overall effect of applying security updates on overall uptime. Over the last year there have been 14 security related updates (I count a silent data loss along with security issues) to the main Debian Etch kernel package. Of those 14, it seems that if you don’t use DCCP, NAT for CIFS or SNMP, IA64, the dialout group, then you will only need to patch for issues 2, 3 (for SMP machines), 4, 5, 7 (sound drivers get loaded on all machines by default), 9, 10, 11, 12, 13, and 14.

    This means 11 reboots a year for SMP machines and 10 a year for uni-processor machines. If a reboot takes three minutes (which is an optimistic assumption) then that would be 30 or 33 minutes of downtime a year due to kernel upgrades. In terms of uptime we talk about the number of “nines”, where the ideal is generally regarded as “five nines” or 99.999% uptime. 33 minutes of downtime a year for kernel upgrades means that you get 99.993% uptime (which is “four nines”). If a reboot takes six minutes (which is not uncommon for servers) then it’s 99.987% uptime (“thee nines”).

    While it doesn’t seem likely to affect the number of “nines” you get, not using SMP has the potential to avoid future security issues. So it seems that when using a Xen (or other virtualisation technology) assigning only one CPU to the DomUs that don’t need any more could improve uptime for them.

    For Xen Dom0′s which don’t have local users or daemons, don’t use DCCP, NAT for CIFS or SNMP, wireless, CIFS, JFFS2, PPPoE, bluetooth, H.323 or SCTP connection tracking, then only issue 11 applies. However for “five nines” you need to have 5 minutes of downtime a year or less. It seems unlikely that a busy Xen server can be rebooted in 5 minutes as all the DomUs need to have their memory saved to disk (writing out the data to disk and reading it back in after a reboot will probably take at least a couple of minutes) or they need to be shutdown and booted again after the Dom0 is rebooted (which is a good procedure if the security fix affects both Dom0 and DomU use), and such shutdowns and reboots of DomU’s will take a lot of time.

    Based on the past year, it seems that a system running as a basic server might get “four nines” if configured for a fast boot (it’s surprising that no-one seems to be talking about recent improvements to the speed of booting as high-availability features) and if the boot is slower then you are looking at “three nines”. For a Xen server unless you have some sort of cluster it seems that “five nines” is unattainable due to reboot times if there is one issue a year, but “four nines” should be easy to get.

    Now while the 14 issues over the last year for the kernel seems likely to be a pattern that will continue, the one issue which affects Xen may not be representative (small numbers are not statistically significant). I feel confident in predicting a need for between 5 and 20 kernel updates next year due to kernel security issues, but I would not be prepared to bet on whether the number of issues affecting Xen will be 0, 1, or 4 (it seems unlikely that there would be 5 or more).

    I will write a future post about some strategies for mitigating these issues.

    Here is my summary of the Debian kernel linux-image-2.6.18-6-686 (Etch kernel) security updates according to it’s changelog, they are not in chronological order, it’s the order of the changelog file:
    (more…)


  • ISP Redundancy and Virtualisation

    Posted on by etbe

    If you want a reliable network then you need to determine an appropriate level of redundancy. When servers were small and there was no well accepted virtual machine technology there were always many points at which redundancy could be employed.

    A common example is a large mail server. You might have MX servers to receive mail from the Internet, front-end servers to send mail to the Internet, database or LDAP servers (of which there is one server for accepting writes and redundant slave servers for allowing clients to read data), and some back-end storage. The back-end storage is generally going to lack redundancy to some degree (all the common options involve mail being stored in one location). So the redundancy would start with the routers which direct traffic to redundant servers (typically a pair of routers in a failover configuration – I would use OpenBSD boxes running CARP if I was given a choice in how to implement this [1], in the past I’ve used Cisco devices).

    The next obvious place for redundancy is for the MX servers (it seems that most ISPs have machines with names such as mx01.example.net to receive mail from the Internet). The way that MX records are used in the DNS means that there is no need for a router to direct traffic to a pair of servers, and even a pair of redundant routers is another point of failure so it’s best to avoid them where possible. A smaller ISP might have two MX machines that are used for both sending outbound mail from their users (which needs to go through a load-balancing router) as well as inbound mail. A larger ISP will have two or more machines dedicated to receiving mail and two or more machines dedicated to sending mail (when you scan for viruses on both sent and received mail it can take a lot of compute power).

    Now the database or LDAP servers used for storing user account data is another possible place for redundancy. While some database and LDAP servers support multi-master operation a more common configuration is to have a single master and multiple slaves which are read-only. This means that you want to have more slaves than are really required so that you can lose one without impacting the service.

    There are several ways of losing a server. The most obvious is a hardware failure. While server class machines will have redundant PSUs, RAID, ECC RAM, and a general high quality of hardware design and manufacture, they still have hardware problems from time to time. Then there are a variety of software related ways of losing a server, most of which stem from operator error and bugs in software. Of course the problem with the operator errors and software bugs is that they can easily take out all redundant machines. If an operator mistakenly decides that a certain command needs to be run on all machines they will often run it on all machines before realising that it causes things to go horribly wrong. A software bug will usually be triggered by the same thing on all machines (EG I’ve had bad data written to a master LDAP server cause all slaves to crash and had a mail loop between two big ISPs take out all front-end mail servers).

    Now if you have a mail server running on a virtual platform such that the MX servers, the mail store, and the database servers all run on the same hardware then redundancy is very unlikely to alleviate hardware problems. It’s difficult to imagine a situation where a hardware failure takes out one DomU while leaving others running.

    It seems to me that if you are running on a single virtual server there is no benefit in having redundancy. However there is benefit in having an infrastructure which supports redundancy. For example if you are going to install new software on one of the servers there is a possibility that the software will fail. Doing upgrades and then having to roll them back is one of the least pleasant parts of sys-admin work, not only is it difficult but it’s also unreliable (new software writes different data to shared files and you have to hope that the old version can cope with them).

    To implement this you need to have a Dom0 that can direct traffic to multiple redundant servers for services which only have a single server. Then when you need to upgrade (be it the application or the OS) you can configure a server on the designated secondary address, get it running, and then disable traffic to the primary server. If there are any problems you can direct traffic back to the primary server (which can be done much more quickly than downgrading software). Also if configured correctly you could have the secondary server be accessible from certain IP addresses only. So you could test the new version of the software using employees as test users while customers use the old version.

    One advantage a virtual machine environment for load balancing is that you can have as many virtual Ethernet devices as you desire and you can configure them using software (without changing cables in the server room). A limitation on the use of load-balancing routers is that traffic needs to go through the router in both directions. This is easy for the path from the Internet to the server room and the path from the server room to the customer network. But when going between servers in the server room it’s a problem (which is not insurmountable, merely painful and expensive). Of course there will be a cost in CPU time for all the extra routing. If instead of having a single virtual ethernet device for all redundant nodes you have a virtual ethernet device for every type of server and use the Dom0 as a router you will end up doubling the CPU requirements for networking without even considering the potential overhead of the load balancing router functionality.

    Finally there is a significant benefit in virtual machines for reliability of services. That is the ability to perform snapshot backups. If you have sufficient disk space and IO capacity you could have a snapshot of your server taken every day and store several old snapshots. Of course doing this effectively would require some minor changes to the configuration of machines to avoid unnecessary writes, this would include not compressing old log files and using a ram disk for /tmp and any other filesystem with transient data. When you have snapshots you can then run filesystem analysis tools on the snapshots to detect any silent corruption that may be occurring and give the potential benefit of discovering corruption before it gets severe (but I have yet to see a confirmed report of this saving anyone). Of course similar snapshot facilities are available on almost every SAN and on many NAS devices, but there are many sites that don’t have the budget to use such equipment.


  • ECC RAM is more useful than RAID

    Posted on by etbe

    A common myth in the computer industry seems to be that ECC (Error Correcting Code – a Hamming Code [0]) RAM is only a server feature.

    The difference between a server and a desktop machine (in terms of utility) is that a server performs tasks for many people while a desktop machine only performs tasks for one person. Therefore when purchasing a desktop machine you can decide how much you are willing to spend for the safety and continuity of your work. For a server it’s more difficult as everyone has a different idea of how reliable a server should be in terms of uptime and in terms of data security. When running a server for a business there is the additional issue of customer confidence. If a server goes down occasionally customers start wondering what else might be wrong and considering whether they should trust their credit card details to the online ordering system.

    So it is obviously apparent that servers need a different degree of reliability – and it’s easy to justify spending the money.

    Desktop machines also need reliability, more so than most people expect. In a business when a desktop machine crashes it wastes employee time. If a crash wastes an hour (which is not unlikely given that previously saved work may need to be re-checked) then it can easily cost the business $100 (the value of the other work that the employee might have done). Two such crashes per week could cost the business as much as $8000 per year. The price difference between a typical desktop machine and a low-end workstation (or deskside server) is considerably less than that (when I investigated the prices almost a year ago desktop machines with server features ranged in price from $800 to $2400 [1]).

    Some machines in a home environment need significant reliability. For example when students are completing high-school their assignments have a lot of time invested in them. Losing an assignment due to a computer problem shortly before it’s due in could impact their ability to get a place in the university course that they most desire! Then there is also data which is irreplaceable, one example I heard of was of a woman who’s computer had a factory pre-load of Windows, during a storm the machine rebooted and reinstalled itself to the factory defaults – wiping several years of baby photos… In both cases better backups would mostly solve the problem.

    For business use the common scenario is to have file servers storing all data and have very little data stored on the PC (ideally have no data on the PC). In this case a disk error would not lose any data (unless the swap space was corrupted and something important was paged out when the disk failed). For home use the backup requirements are quite small. If a student is working on an important assignment then they can back it up to removable media whenever they reach a milestone. Probably the best protection against disk errors destroying assignments would be a bulk purchase of USB flash storage sticks.

    Disk errors are usually easy to detect. Most errors are in the form of data which can not be read back, when that happens the OS will give an error message to the user explaining what happened. Then if you have good backups you revert to them and hope that you didn’t lose too much work in the mean-time (you also hope that your backups are actually readable – but that’s another issue). The less common errors are lost-writes – where the OS writes data to disk but the disk doesn’t store it. This is a little more difficult to discover as the drive will return bad data (maybe an old version of the file data or maybe data from a different file) and claim it to be good.

    The general idea nowadays is that a filesystem should check the consistency of the data it returns. Two new filesystems, ZFS from Sun [2] and BTRFS from Oracle [3] implement checksums of data stored on disk. ZFS is apparently production ready while BTRFS is apparently not nearly ready. I expect that from now on whenever anyone designs a filesystem for anything but the smallest machines (EG PDAs and phones) they will include data integrity mechanisms in the design.

    I believe that once such features become commonly used the need for RAID on low-end systems will dramatically decrease. A combination of good backups and knowing when your live data is corrupted will often be a good substitute for preserving the integrity of the live data. Not that RAID will necessarily protect your data – with most RAID configurations if a hard disk returns bad data and claims it to be good (the case of lost writes) then the system will not read data from other disks for checksum validation and the bad data will be accepted.

    It’s easy to compute checksums of important files and verify them later. One simple way of doing so is to compress the files, every file compression program that I’ve seen has some degree of error detection.

    Now the real problem with RAM which lacks ECC is that it can lose data without the user knowing. There is no possibility of software checks because any software which checks for data integrity could itself be mislead by memory errors. I once had a machine which experienced filesystem corruption on occasion, eventually I discovered that it had a memory error (memtest86+ reported a problem). I will never know whether some data was corrupted on disk because of this. Sifting through a large amount of stored data for some files which may have been corrupted due to memory errors is almost impossible. Especially when there was a period of weeks of unreliable operation of the machine in question.

    Checking the integrity of file data by using the verify option of a file compression utility, fsck on a filesystem that stores checksums on data, or any of the other methods is not difficult.

    I have a lot of important data on machines that don’t have ECC. One reason is that machines which have ECC cost more and have other trade-offs (more expensive parts, more noise, more electricity use, and the small supply makes it difficult to get good deals). Another is that there appear to be no laptops which support ECC (I use a laptop for most of my work). On the other hand RAID is very cheap and simple to implement, just buy a second hard disk and install software RAID – I think that all modern OSs support RAID as a standard installation option. So in spite of the fact that RAID does less good than a combination of ECC RAM and good backups (which are necessary even if you have RAID), it’s going to remain more popular in high-end desktop systems for a long time.

    The next development that seems interesting is the large portion of the PC market which is designed not to have the space for more than one hard disk. Such compact machines (known as Small Form Factor or SFF) could easily be designed to support ECC RAM. Hopefully the PC companies will add reliability features in one area while removing them in another.


  • Controlling a STONITH and Upgrading a Cluster

    Posted on by etbe

    One situation that you will occasionally encounter when running a Heartbeat cluster is a need to prevent a STONITH of a node. As documented in my previous post about testing STONITH the ability to STONITH nodes is very important in an operating cluster. However when the sys-admin is performing maintenance on the system or programmers are working on a development or test system it can be rather annoying.

    One example of where STONITH is undesired is when upgrading packages of software related to the cluster services. If during a package upgrade the data files and programs related to the OCF script are not synchronised (EG you have two programs that interact and upgrading one requires upgrading the other) at the moment that the status operation is run then an error may occur which may trigger a STONITH. Another possibility is that if using small systems for testing or development (EG running a cluster under Xen with minimal RAM assigned to each node) then a package upgrade may cause the system to thrash which might then cause a timeout of the status scripts (a problem I encounter when upgrading my Xen test instances that have 64M of RAM).

    If a STONITH occurs during the process of a package upgrade then you are likely to have consistency problems with the OS due to RPM and DPKG not correctly calling fsync(), this can cause the OCF scripts to always fail to run the status command which can cause an infinite loop of the cluster nodes in question being STONITHed. Incidentally the best way to test for this (given the problems of a STONITH sometimes losing log data) is to boot the node in question without Heartbeat running and then run the OCF status commands manually (I previously documented three ways of doing this).

    Of course the ideal (and recommended) way of solving this problem is to migrate all services from a node using the crm_resource program. But in a test or development situation you may forget to migrate all services or simply forget to run the migration before the package upgrade starts. In that case the best thing to do is to be able to remove the ability to call STONITH . For my testing I use Xen and have the nodes ssh to the Dom0 to call STONITH, so all I have to do to remove the STONITH ability is to stop the ssh daemon on the Dom0. For a more serious test network (EG using IPMI or an equivalent technology to perform a hardware STONITH as well as ssh for OS level STONITH on a private network) a viable option might be to shut down the switch port used for such operations – shutting down switch ports is not a nice thing to do, but to allow you to continue work on a development environment without hassle it’s a reasonable hack.

    When choosing your method of STONITH it’s probably worth considering what the possibilities are for temporarily disabling it – preferably without having to walk to the server room.


  • Starting a Heartbeat Resource Without Heartbeat

    Posted on by etbe

    The command crm_resource allows you to do basic editing of resources in the Heartbeat configuration database. But sometimes you need to do different things and the tool xmlstarlet is a good option.

    The below script can be used for testing Heartbeat OCF resource scripts. It uses the Heartbeat management program cibadmin to get the XML configuration data and then uses xmlstarlet to process it. The sel option for xmlstarlet selects some data from an XML file, the -t -m options instruct it to match data from a template. The template is the /resources/primitive part. The --value-of expression will print the values of some labels from the XML. The script will concatenate the name and value tags and export them as environment variables (see my post about Configuring a Heartbeat Service for an explanation of the use of the variables). The TYPE variable is the name of the script under the /usr/lib/ocf/resource.d/heartbeat directory.

    In recent versions of Heartbeat (2.1.x) the OCF_ROOT environment variable must be set before an OCF script is called. Setting it on older versions of Heartbeat doesn’t do any harm so I unconditionally set it in this script (which should work for all 2.x.x versions of Heartbeat).

    The first parameter for the script is the id of the service to be operated on and the second parameter is the operation to perform (start, stop, and status are the only interesting values). The script will echo the exit code to the screen (0 means success, 7 means that the service is not running or the operation failed, and any other number means a serious error that will trigger a STONITH if Heartbeat gets it).

    #!/bin/sh
    $(cibadmin -Q -o resources| xmlstarlet sel -t -m \
    "/resources/primitive [@id='$1']/instance_attributes/attributes/nvpair" \
    --value-of "concat('export OCF_RESKEY_',@name,'=',@value,'
','TYPE=',../../../@type,'
')")
    OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/$TYPE $2
    echo $?

    Below is another version of the same script that instead uses crm_resource to get the XML data. The output of crm_resource has a couple of lines of non-XML data at the start (removed by the grep) and also only gives the XML tree related to the primitive in question (so the /resources part is removed from the xmlstarlet command-line).

    #!/bin/sh
    $(crm_resource -r $1 -x | grep -v ^[a-z] | xmlstarlet sel -t -m \
      "/primitive/instance_attributes/attributes/nvpair" --value-of \
      "concat('export OCF_RESKEY_',@name,'=',@value,'
','TYPE=',../../../@type,'
')")
    OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/$TYPE $2
    echo $?

    The problem with both of those scripts is that they rely on Heartbeat being operational. Performing any operations other than a status check while Heartbeat is running is a risky thing to do. If Heartbeat starts a service at the same time as you start it via such a script then the results will probably be undesired. One situation where it is safe to run this is when a service fails to start. After it has failed repeatedly Heartbeat may stop trying to restart it (depending on the configuration) in which case it will be safe to try and start it. Also you can put in temporary constraints to stop the resource from running by repeatedly running crm_resource -M -r ID until all nodes have been prohibited from running it (make sure you run crm_resource -U -r ID afterwards to remove the temporary constraints).

    The following script does the same thing but directly reads the XML file for the Heartbeat configuration. This is designed to be used when Heartbeat is not running. For example you could copy the XML file from a running cluster to a test machine and then test your OCF resource scripts.

    #!/bin/sh
    $(cat /var/lib/heartbeat/crm/cib.xml| xmlstarlet sel -t -m \
      "/cib/configuration/resources/primitive [@id='$1']/instance_attributes/attributes/nvpair" \
      --value-of \
      "concat('export OCF_RESKEY_',@name,'=',@value,'
','TYPE=',../../../@type,'
')")
    OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/heartbeat/$TYPE $2
    echo $?


  • Testing STONITH

    Posted on by etbe

    One problem that I have had in configuring Heartbeat clusters is in performing a STONITH that originates outside the Heartbeat system.

    STONITH was designed for the Heartbeat system to know when a node is not operating correctly (this can either be determined by the node itself or by other nodes in the network) and then force a hardware reset so that the non-functional node will not interfere with another node that is designated to take over the service.

    However sometimes code that is called by Heartbeat will have more information about the state of the system than Heartbeat can access. For example if I have a service that accesses a filesystem on an external RAID then it’s common for the RAID to track who is accessing it. In some situations the RAID hardware has the ability to “fence” the access (so that when machine B mounts the filesystem machine A can no longer access it). In other situations the RAID may only be capable of informing the system that another machine is registered as the owner of the device. To solve this problem a machine that is to mount such a device must either prohibit the previous owner from accessing the device (which may be impossible or unreasonably difficult) or reset the previous owner.

    Until recently I had been doing this by writing some code to extract the STONITH configuration from the CIB and call the stonith utility. The problem with this is that there is no requirement that every node be capable of performing a STONITH on every other node, and that even if every node is are designed to be capable of rebooting every other node a partial failure condition may restrict the set of nodes that are capable of performing a STONITH on the target.

    Currently the recommended way of doing this is via the test program. Below is an example of the command used to reset the node node-1 with a timeout of 20000ms and the result of it being successfully completed. I have suggested that the Heartbeat developers make an official interface for doing this (rather than a test of the API) and I believe that this is being considered. In the mean time the following is the only way of doing it:

    # /usr/lib/heartbeat/stonithdtest/apitest 1 node-1 20000 0
    optype=1, node_name=node-1, result=0, node_list=node-0



  • dinamic_sidebar 4 none

©2012 etbe - Russell Coker Entries (RSS) and Comments (RSS)  Raindrops Theme