Archives

Categories

DRBD Benchmarking

I’ve got some performance problems with a mail server that’s using DRBD so I’ve done some benchmark tests to try and improve things. I used Postal for testing delivery to an LMTP server [1]. The version of Postal I released a few days ago had a bug that made LMTP not work, I’ll release a new version to fix that next time I work on Postal – or when someone sends me a request for LMTP support (so far no-one has asked for LMTP support so I presume that most users don’t mind that it’s not yet working).

The local spool on my test server is managed by Dovecot, the Dovecot delivery agent stores the mail and the Dovecot POP and IMAP servers provide user access. For delivery I’m using the LMTP server I wrote which has been almost ready for GPL release for a couple of years. All I need to write is a command-line parser to support delivery options for different local delivery agents. Currently my LMTP server is hard-coded to run /usr/lib/dovecot/deliver and has it’s parameters hard-coded too. As an aside if someone would like to contribute some GPL C/C++ code to convert a string like “/usr/lib/dovecot/deliver -e -f %from% -d %to% -n” into something that will populate an argv array for execvp() then that would be really appreciated.

Authentication is to a MySQL server running on a fast P4 system. The MySQL server was never at any fraction of it’s CPU or disk IO capacity so using a different authentication system probably wouldn’t have given different results. I used MySQL because it’s what I’m using in production. Apart from my LMTP server and the new version of Postal all software involved in the testing is from Debian/Squeeze.

The Tests

All tests were done on a 20G IDE disk. I started testing with a Pentium-4 1.5GHz system with 768M of RAM but then moved to a Pentium-4 2.8GHz system with 1G of RAM when I found CPU time to be a bottleneck with barrier=0. All test results are for the average number of messages delivered per minute for a 19 minute test run where the first minute’s results are discarded. The delivery process used 12 threads to deliver mail.

P4-1.5 p4-2.8
Default Ext4 1468 1663
Ext4 max_batch_time=30000 1385 1656
Ext4 barrier=0 1997 2875
Ext4 on DRBD no secondary 1810 2409

When doing the above tests the 1.5GHz system was using 100% CPU time when the filesystem was mounted with barrier=0, about half of that was for system (although I didn’t make notes at the time). So the testing on the 1.5GHz system showed that increasing the Ext4 max_batch_time number doesn’t give a benefit for a single disk, that mounting with barrier=0 gives a significant performance benefit, and that using DRBD in disconnected mode gives a good performance benefit through forcing barrier=0. As an aside I wonder why they didn’t support barriers on DRBD given all the other features that they have for preserving data integrity.

The tests with the 2.8GHz system demonstrate the performance benefits of having adequate CPU power, as an aside I hope that Ext4 is optimised for multi-core CPUs because if a 20G IDE disk needs a 2.8GHz P4 then modern RAID arrays probably require more CPU power than a single core can provide.

It’s also interesting to note that a degraded DRBD device (where the secondary has never been enabled) only gives 84% of the performance of /dev/sda4 when mounted with barrier=0.

p4-2.8
Default Ext4 1663
Ext4 max_batch_time=30000 1656
Ext4 min_batch_time=15000,max_batch_time=30000 1626
Ext4 max_batch_time=0 1625
Ext4 barrier=0 2875
Ext4 on DRBD no secondary 2409
Ext4 on DRBD connected C 1575
Ext4 on DRBD connected B 1428
Ext4 on DRBD connected A 1284

Of all the options for batch times that I tried it seemed that every change decreased the performance slightly but as the greatest decrease in performance was only slightly more than 2% it doesn’t matter much.

One thing that really surprised me was the test results from different replication protocols. The DRBD replication protocols are documented here [2]. Protocol C is fully synchronous – a write request doesn’t complete until the remote node has it on disk. Protocol B is memory synchronous, the write is complete when it’s on a local disk and in RAM on the other node. Protocol A is fully asynchronous, a write is complete when it’s on a local disk. I had expected protocol A to give the best performance as it has lower latency for critical write operations and for protocol C to be the worst. My theory is that DRBD has a performance bug for the protocols that the developers don’t recommend.

One other thing I can’t explain is that according to iostat the data partition on the secondary DRBD node had almost 1% more sectors written than the primary and the number of writes was more than 1% greater on the secondary. I had hoped that with protocol A the writes would be combined on the secondary node to give a lower disk IO load.

I filed Debian bug report #654206 about the kernel not exposing the correct value for max_batch_time. The fact that no-one else has reported that bug (which is in kernels from at least 2.6.32 to 3.1.0) is an indication that not many people have found it useful.

Conclusions

When using DRBD use protocol C as it gives better integrity and better performance.

Significant CPU power is apparently required for modern filesystems. The fact that a Maxtor 20G 7200rpm IDE disk [3] can’t be driven at full speed by a 1.5GHz P4 was a surprise to me.

DRBD significantly reduces performance when compared to a plain disk mounted with barrier=0 (for a fair comparison). The best that DRBD could do in my tests was 55% of native performance when connected and 84% of native performance when disconnected.

When comparing a cluster of cheap machines running DRBD on RAID-1 arrays to a single system running RAID-6 with redundant PSUs etc the performance loss from DRBD is a serious problem that can push the economic benefit back towards the single system.

Next I will benchmark DRBD on RAID-1 and test the performance hit of using bitmaps with Linux software RAID-1.

If anyone knows how to make a HTML table look good then please let me know. It seems that the new blog theme that I’m using prevents borders.

Update:

I mentioned my Debian bug report about the mount option and the fact that it’s all on Debian/Squeeze.

7 comments to DRBD Benchmarking

  • Glenn

    May I assume that the drdb replication was between localhost partitions, or was it across a network of some form?

    Looking fowards to the raid results. Thanks for sharing.

  • etbe

    As far as I am aware DRBD doesn’t work on localhost.

    It was over a 100baseT network and there was a single Ethernet cable used for SMTP and DRBD. I doubt that the speed of the network made a great impact on performance given that the average throughput was about 4MB/s, but I will do further tests in this regard.

  • Mario

    Since there is no info on how the drbd Device is configured did you try any of drbd recommandations into account:

    http://www.drbd.org/users-guide/s-throughput-tuning.html

  • DRBD configuration? Network configuration details? Anything?

    Did you bother to do the benchmarks the DRBD User’s Guide suggests for checking your DRBD device’s throughput and latency?

    Properly tuned DRBD introduces a throughput penalty of <10% versus standalone disk, with 5% being typical. If you're experiencing a 45% performance drop, you're most likely operating on a severely flawed configuration. Actually, running on 100Mbit Ethernet constitutes such a flaw in itself. Max theoretical throughput over 100Mbps is approx 12 MB/s, and even an age-old disk typically is capable of pulling about 30-40 MB/s. So by running DRBD over a 100Mbps link you'd effectively be limiting your replication bandwidth (and hence, with protocol C, your effective device bandwidth) to about a quarter of what your standalone disk can do.

  • Glenn

    Ah, fair enough, so many people use virtualisation to test (ie two VMs on localhost) I forget to be clearer. So in this case, physical network, cool :). And I do see it’s the relative results that are interesting.

  • etbe

    Florian: It was a fairly default Debian configuration.

    In my tests a degraded DRBD device (IE the secondary didn’t exist) gave more than a 10% penalty when compared to a standalone disk that was mounted with barrier=0. How can I tune the performance of a degraded DRBD array?

    An old disk can do 20+ MB/s for contiguous IO, for random seeks (such as delivering maildir mail with an average message size of 70K) the throughput is going to be a lot less.

  • How can I tune the performance of a degraded DRBD array?

    al-extents, primarily. And also the no-disk-* and no-md-* options, but considering you’re doing this on an old IDE disk, those would be unsafe to use. Finally, I dare say that that old IDE disk is probably operating with its (volatile) disk cache enabled, so your disk would be lying to the rest of your system about its write bandwidth. DRBD isn’t as easily fooled; it will typically use blkdev_issue_flush after writing metadata updates, which will further push down the perceived performance.