One situation that you will occasionally encounter when running a Heartbeat cluster is a need to prevent a STONITH of a node. As documented in my previous post about testing STONITH the ability to STONITH nodes is very important in an operating cluster. However when the sys-admin is performing maintenance on the system or programmers are working on a development or test system it can be rather annoying.
One example of where STONITH is undesired is when upgrading packages of software related to the cluster services. If during a package upgrade the data files and programs related to the OCF script are not synchronised (EG you have two programs that interact and upgrading one requires upgrading the other) at the moment that the status operation is run then an error may occur which may trigger a STONITH. Another possibility is that if using small systems for testing or development (EG running a cluster under Xen with minimal RAM assigned to each node) then a package upgrade may cause the system to thrash which might then cause a timeout of the status scripts (a problem I encounter when upgrading my Xen test instances that have 64M of RAM).
If a STONITH occurs during the process of a package upgrade then you are likely to have consistency problems with the OS due to RPM and DPKG not correctly calling fsync(), this can cause the OCF scripts to always fail to run the status command which can cause an infinite loop of the cluster nodes in question being STONITHed. Incidentally the best way to test for this (given the problems of a STONITH sometimes losing log data) is to boot the node in question without Heartbeat running and then run the OCF status commands manually (I previously documented three ways of doing this).
Of course the ideal (and recommended) way of solving this problem is to migrate all services from a node using the crm_resource program. But in a test or development situation you may forget to migrate all services or simply forget to run the migration before the package upgrade starts. In that case the best thing to do is to be able to remove the ability to call STONITH . For my testing I use Xen and have the nodes ssh to the Dom0 to call STONITH, so all I have to do to remove the STONITH ability is to stop the ssh daemon on the Dom0. For a more serious test network (EG using IPMI or an equivalent technology to perform a hardware STONITH as well as ssh for OS level STONITH on a private network) a viable option might be to shut down the switch port used for such operations – shutting down switch ports is not a nice thing to do, but to allow you to continue work on a development environment without hassle it’s a reasonable hack.
When choosing your method of STONITH it’s probably worth considering what the possibilities are for temporarily disabling it – preferably without having to walk to the server room.