Archives

Categories

Some Notes on DRBD

DRBD is a system for replicating a block device across multiple systems. It’s most commonly used for having one system write to the DRBD block device such that all writes are written to a local disk and a remote disk. In the default configuration a write is not complete until it’s committed to disk locally and remotely. There is support for having multiple systems write to disk at the same time, but naturally that only works if the filesystem drivers are aware of this.

I’m installing DRBD on some Debian/Squeeze servers for the purpose of mirroring a mail store across multiple systems. For the virtual machines which run mail queues I’m not using DRBD because the failure conditions that I’m planning for don’t include two disks entirely failing. I’m planning for a system having an outage for a while so it’s OK to have some inbound and outbound mail delayed but it’s not OK for the mail store to be unavailable.

Global changes I’ve made in /etc/drbd.d/global_common.conf

In the common section I changed the protocol from “C” to “B“, this means that a write() system call returns after data is committed locally and sent to the other node. This means that if the primary node goes permanently offline AND if the secondary node has a transient power failure or kernel crash causing the buffer contents to be lost then writes can be lost. I don’t think that this scenario is likely enough to make it worth choosing protocol C and requiring that all writes go to disk on both nodes before they are considered to be complete.

In the net section I added the following:

sndbuf-size 512k;
data-integrity-alg sha1;

This uses a larger network sending buffer (apparently good for fast local networks – although I’d have expected that the low delay on a local Gig-E would give a low bandwidth delay product) and to use sha1 hashes on all packets (why does it default to no data integrity).

Reserved Numbers

The default port number that is used is 7789. I think it’s best to use ports below 1024 for system services so I’ve setup some systems starting with port 100 and going up from there. I use a different port for every DRBD instance, so if I have two clustered resources on a LAN then I’ll use different ports even if they aren’t configured to ever run on the same system. You never know when the cluster assignment will change and DRBD port numbers seems like something that could potentially cause real problems if there was a port conflict.

Most of the documentation assumes that the DRBD device nodes on a system will start at /dev/drbd0 and increment, but this is not a requirement. I am configuring things such that there will only ever be one /dev/drbd0 on a network. This means that there is no possibility of a cut/paste error in a /etc/fstab file or a Xen configuration file causing data loss. As an aside I recently discovered that a Xen Dom0 can do a read-write mount of a block device that is being used read-write by a Xen DomU, there is some degree of protection against a DomU using a block device that is already being used in the Dom0 but no protection against the Dom0 messing with the DomU’s resources.

It would be nice if there was an option of using some device name other than /dev/drbdX where X is a number. Using meaningful names would reduce the incidence of doing things to the wrong device.

As an aside it would be nice if there was some sort of mount helper for determining which devices shouldn’t be mounted locally and which mount options are permitted – it MIGHT be OK to do a read-only mount of a DomU’s filesystem in the Dom0 but probably all mounting should be prevented. Also a mount helper for such things would ideally be able to change the default mount options, for example it could make the defaults be nosuid,nodev (or even noexec,nodev) when mounting filesystems from removable devices.

Initial Synchronisation

After a few trials it seems to me that things generally work if you create DRBD on two nodes at the same time and then immediately make one of them primary. If you don’t then it will probably refuse to accept one copy of the data as primary as it can’t seem to realise that both are inconsistent. I can’t understand why it does this in the case where there are two nodes with inconsistent data, you know for sure that there is no good data so there should be an operation to zero both devices and make them equal. Instead there

The solution sometimes seems to be to run “drbdsetup /dev/drbd0 primary –” (where drbd0 is replaced with the appropriate device). This seems to work well and allowed me to create a DRBD installation before I had installed the second server. If the servers have been connected in Inconsistent/Inconsistent state then the solution seems to involve running “drbdadm -- --overwrite-data-of-peer primary db0-mysql” (for the case of a resource named db0-mysql defined in /etc/drbd.d/db0-mysql.res).

Also it seems that some commands can only be run from one node. So if you have a primary node that’s in service and another node in Secondary/Unknown state (IE disconnected) with data state Inconsistent/DUnknown then while you would expect to be able to connect from the secondary node is appears that nothing other than a “drbdadm connect” command run from the primary node will get things going.

6 comments to Some Notes on DRBD

  • Hi,

    To mount a resource by name have a look at /dev/drbd/by-name

    HTH
    Saz

  • etbe

    Thanks for the suggestion, the /dev/drbd/by-res directory on my systems seems to have what you describe.

  • Sorry, that’s what I meant.

    As long as both volumes are connected and in inconsistent state, you have to choose one of them and sync them initially. No matter how long those volumes have been staying in inconsistent, it has worked for me everytime. If you meant, that you’re installing one server and setting up DRBD to a usable state (except replication), afair you mark one system as primary and the (later installed) server should sync itself.

    To avoid problems with two DRBD devices in the same network, you can set a shared-secret in the configuration. If you’re using per-resource secrets you’ll never get a connection, but an error in your log, telling you that something seems to be wrong.

    Also think about to enable on-line device verification sometimes. It makes you aware of any problems.

    Running commands from one node (where possible, not all commands are) is only possible if nodes are connected. If one node is in state ‘unknown’, there might be no connection.
    drbd-overview is a nice front-end to the actual state of your DRBD devices. Much nicer than ‘cat /proc/drbd’ as it’s displaying resource names and mount informations (if it’s mounted)

    Also, sndbuf-size defaults to 0, which enables the kernel to autotune this value.

  • sndbuf-size (and rcvbuf-size) should really be set to 0 on all recent kernels to take advantage of TCP buffer autotuning. Protocol C actually being faster than B is a long-standing (and rather well-known) anomaly; when in doubt, people should essentially always go with protocol C.

    As for the initial sync, please see http://www.drbd.org/users-guide-8.3/s-first-time-up.html and http://www.drbd.org/users-guide-8.3/s-initial-full-sync.html. Steffen has already summed it up to some extent, but the User’s Guide has additional information.

    In case you’re coming to LCA, you can find me either at the storage & HA miniconf or my High Availability Sprint tutorial to discuss DRBD related matters if interested.

  • etbe

    I’ll be at LCA.

  • As for the initial sync, you can also completely bypass it if both disks are blank. This is done by using the following command:


    drbdadm -- --clear-bitmap new-current-uuid r0

    Where r0 is your resource. You’ll want to run this from the node you want to be primary.