I’ve got some performance problems with a mail server that’s using DRBD so I’ve done some benchmark tests to try and improve things. I used Postal for testing delivery to an LMTP server . The version of Postal I released a few days ago had a bug that made LMTP not work, I’ll release a new version to fix that next time I work on Postal – or when someone sends me a request for LMTP support (so far no-one has asked for LMTP support so I presume that most users don’t mind that it’s not yet working).
The local spool on my test server is managed by Dovecot, the Dovecot delivery agent stores the mail and the Dovecot POP and IMAP servers provide user access. For delivery I’m using the LMTP server I wrote which has been almost ready for GPL release for a couple of years. All I need to write is a command-line parser to support delivery options for different local delivery agents. Currently my LMTP server is hard-coded to run /usr/lib/dovecot/deliver and has it’s parameters hard-coded too. As an aside if someone would like to contribute some GPL C/C++ code to convert a string like “/usr/lib/dovecot/deliver -e -f %from% -d %to% -n” into something that will populate an argv array for execvp() then that would be really appreciated.
Authentication is to a MySQL server running on a fast P4 system. The MySQL server was never at any fraction of it’s CPU or disk IO capacity so using a different authentication system probably wouldn’t have given different results. I used MySQL because it’s what I’m using in production. Apart from my LMTP server and the new version of Postal all software involved in the testing is from Debian/Squeeze.
All tests were done on a 20G IDE disk. I started testing with a Pentium-4 1.5GHz system with 768M of RAM but then moved to a Pentium-4 2.8GHz system with 1G of RAM when I found CPU time to be a bottleneck with barrier=0. All test results are for the average number of messages delivered per minute for a 19 minute test run where the first minute’s results are discarded. The delivery process used 12 threads to deliver mail.
|Ext4 on DRBD no secondary||1810||2409|
When doing the above tests the 1.5GHz system was using 100% CPU time when the filesystem was mounted with barrier=0, about half of that was for system (although I didn’t make notes at the time). So the testing on the 1.5GHz system showed that increasing the Ext4 max_batch_time number doesn’t give a benefit for a single disk, that mounting with barrier=0 gives a significant performance benefit, and that using DRBD in disconnected mode gives a good performance benefit through forcing barrier=0. As an aside I wonder why they didn’t support barriers on DRBD given all the other features that they have for preserving data integrity.
The tests with the 2.8GHz system demonstrate the performance benefits of having adequate CPU power, as an aside I hope that Ext4 is optimised for multi-core CPUs because if a 20G IDE disk needs a 2.8GHz P4 then modern RAID arrays probably require more CPU power than a single core can provide.
It’s also interesting to note that a degraded DRBD device (where the secondary has never been enabled) only gives 84% of the performance of /dev/sda4 when mounted with barrier=0.
|Ext4 on DRBD no secondary||2409|
|Ext4 on DRBD connected C||1575|
|Ext4 on DRBD connected B||1428|
|Ext4 on DRBD connected A||1284|
Of all the options for batch times that I tried it seemed that every change decreased the performance slightly but as the greatest decrease in performance was only slightly more than 2% it doesn’t matter much.
One thing that really surprised me was the test results from different replication protocols. The DRBD replication protocols are documented here . Protocol C is fully synchronous – a write request doesn’t complete until the remote node has it on disk. Protocol B is memory synchronous, the write is complete when it’s on a local disk and in RAM on the other node. Protocol A is fully asynchronous, a write is complete when it’s on a local disk. I had expected protocol A to give the best performance as it has lower latency for critical write operations and for protocol C to be the worst. My theory is that DRBD has a performance bug for the protocols that the developers don’t recommend.
One other thing I can’t explain is that according to iostat the data partition on the secondary DRBD node had almost 1% more sectors written than the primary and the number of writes was more than 1% greater on the secondary. I had hoped that with protocol A the writes would be combined on the secondary node to give a lower disk IO load.
I filed Debian bug report #654206 about the kernel not exposing the correct value for max_batch_time. The fact that no-one else has reported that bug (which is in kernels from at least 2.6.32 to 3.1.0) is an indication that not many people have found it useful.
When using DRBD use protocol C as it gives better integrity and better performance.
Significant CPU power is apparently required for modern filesystems. The fact that a Maxtor 20G 7200rpm IDE disk  can’t be driven at full speed by a 1.5GHz P4 was a surprise to me.
DRBD significantly reduces performance when compared to a plain disk mounted with barrier=0 (for a fair comparison). The best that DRBD could do in my tests was 55% of native performance when connected and 84% of native performance when disconnected.
When comparing a cluster of cheap machines running DRBD on RAID-1 arrays to a single system running RAID-6 with redundant PSUs etc the performance loss from DRBD is a serious problem that can push the economic benefit back towards the single system.
Next I will benchmark DRBD on RAID-1 and test the performance hit of using bitmaps with Linux software RAID-1.
If anyone knows how to make a HTML table look good then please let me know. It seems that the new blog theme that I’m using prevents borders.
I mentioned my Debian bug report about the mount option and the fact that it’s all on Debian/Squeeze.