Xen and Heartbeat

Xen (a system for running multiple virtual Linux machines) and has some obvious benefits for testing Heartbeat (the clustering system) – the cheapest new machine that is on sale in Australia can be used to simulate a four node cluster. I’m not sure whether there is any production use for a cluster running under Xen […]

configuring a Heartbeat service

In my last post about Heartbeat I gave an example of a script to start and stop a cluster service. In that post I omitted to mention that the script goes in the directory /usr/lib/ocf/resource.d/heartbeat.

To actually use the script you need to write some XML configuration to tell Heartbeat which parameters should be passed […]

Heartbeat service scripts

A service script for Heartbeat needs to support at least three operations, start, stop, and status. The operations will return 0 on success, 7 on failure (which in the case of the monitor script means that the service is not running) and any other value to indicate that something has gone wrong.

In the second […]

Another Heartbeat 2.0 STONITH example configuration

In a Heartbeat cluster installation it may not be possible to have one STONITH device be used to reboot all nodes. To support this it is possible to have multiple STONITH devices configured that will each be used to reboot different nodes in the cluster. In the following code section there is an example of […]

Heartbeat version 2.0 CIB STONITH example configuration

Below is a sample script to configure the ssh STONITH agent for the Heartbeat system. STONITH will reboot nodes when things go wrong to restore the integrity of the cluster.

The STONITH test program supports the -n option to list parameters and the -l option to list nodes. The following is an example of using […]

priorities for heartbeat services

Currently I am considering the priority scheme to use for some highly available services running on Linux with Heartbeat.

The Heartbeat system has a number of factors that can be used to determine the weight for running a particular service on a given node. One is the connectivity to other systems determined by ping (every […]

more about Heartbeat

In a comment on my blog post “a Heartbeat developer comments on my blog post” Alan Robertson writes: I got in a hurry on my math because of the emergency. So, there are even more assumptions (errors?) than I documented. In particular, the probability model I gave was for a particular node to fail. So […]

a Heartbeat developer comments on my blog post

Alan Robertson (a major contributor to the Heartbeat project) commented on my post failure probability and clusters. His comment deserves wider readership than a comment generally gets so I’m making a post out of it. Here it is:

One of my favorite phrases is “complexity is the enemy of reliability” . This is absolutely true, […]

2 node vs 3+ node clusters

A comment on my post about the failure probability of clusters suggested that a six node cluster that has one node fail should become a five node cluster.

The problem with this is what to do when nodes recover from a failure. For example if a six node cluster had a node fail and became […]

failure probability and clusters

When running a high-availability cluster of two nodes it will generally be configured such that if one node fails then the other runs. Some common operation (such as accessing a shared storage device or pinging a router) will be used by the surviving node to determine that the other node is dead and that it’s […]