Linux, politics, and other interesting things
DRBD is a system for replicating a block device across multiple systems. It’s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it’s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.
I’m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I’m not using DRBD because the failure conditions that I’m planning for don’t include two disks entirely failing. I’m planning for a system having an outage for a while so it’s OK to have some inbound and outbound mail delayed but it’s not OK for the mail store to be unavailable.
In the common section I changed the protocol from “C” to “B“, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don’t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.
In the net section I added the following:
This uses a larger network sending buffer (apparently good for fast local networks – although I’d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).
The default port number that is used is 7789. I think it’s best to use ports below 1024 for system services so I’ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I’ll use different ports even if they aren’t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.
Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU’s resources.
It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.
As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn’t be mounted locally and which mount options are permitted – it MIGHT be OK to do a read-only mount of a DomU’s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.
After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don’t then it will probably refuse to accept one copy of the data as primary as it can’t seem to realise that both are inconsistent. I can’t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there
The solution sometimes seems to be to run “drbdsetup /dev/drbd0 primary –” (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running “drbdadm -- --overwrite-data-of-peer primary db0-mysql” (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res).
Also it seems that some commands can only be run from one node. So if you have a primary node that’s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a “drbdadm connect” command run from the primary node will get things going.