Archives

Categories

Moving a Mail Server

Nowadays it seems that most serious mail servers (IE mail servers suitable for running an ISP) use one file per message. In the old days (before about 1996) almost all Internet email was stored in Mbox format [1]. In Mbox you have a large number of messages in a single file, most users would have a single file with all their mail and the advanced users would have multiple files for storing different categories of mail. A significant problem with Mbox is that it was necessary to read the entire file to determine how many messages were stored, as determining the number of messages was the first thing that was done in a POP connection this caused significant performance problems for POP servers. Even more serious problems occurred when messages were deleted as the Mbox file needed to be compacted.

Maildir is a mail storage method developed by Dan Bernstein based around the idea of one file per message [2]. It solves the performance problems of Mbox and also solves some reliability issues (file locking is not needed). It was invented in 1996 and has since become widely used in Unix messaging systems.

The Cyrus IMAP server [3] uses a format similar to Maildir. The most significant difference is that the Cyrus data is regarded as being private to the Cyrus system (IE you are not supposed to mess with it) while Maildir is designed to be used by any tools that you wish (EG my Maildir-Bulletin project [4]).

One down-side to such formats that many people don’t realise (except at the worst time) is the the difficulty in performing backups. As a test I used LVM volume stored on a RAID-1 array of two 20G 7200rpm IDE disks with 343M of data used (according to “df -h” and 39358 inodes in use (as there were 5000 accounts with maildir storage that means 25,000 directories for the home directories and Maildir directories). So there were 14,358 files. To create a tar file of that (written to /dev/null via dd to avoid tar’s optimisation of /dev/null) took 230.6 seconds, 105MB of data was transferred for a transfer rate of 456KB/s. It seems that tar stores the data in a more space efficient manner than the Ext3 filesystem (105MB vs 343MB). For comparison either of the two disks can deliver 40MB/s for the inner tracks. So it seems that unless the amount of used space is less than 1% of the total disk space it will be faster to transfer a filesystem image.

If you have disks that are faster than your network (EG old IDE disks can sustain 40MB/s transfer rates on machines with 100baseT networking and RAID arrays can easily sustain hundreds of megabytes a second on machines with gigabit Ethernet networking) then compression has the potential to improve the speed. Of course the fastest way of transferring such data is to connect the disks to the new machine, this is usually possible when using IDE disks but the vast number of combinations of SCSI bus, disk format, and RAID controller makes it almost impossible on systems with hardware RAID.

The first test I made of compression was on a 1GHz Athlon system which could compress (via gzip -1) 100M of data with four seconds of CPU time. This means that compression has the potential to reduce the overall transfer time (the machine in question has 100baseT networking and no realistic option of adding Gig-E).

The next test I made was on a 3.2GHz Pentium-4 Xeon system. It compressed 1000M of data in 77 seconds (it didn’t have the same data as the Athlon system so it can’t be directly compared), as 1000M would take something like 10 or 12 seconds to transfer at Gig-E speeds that obviously isn’t a viable option.

The gzip -1 compression however compressed the data to 57% of it’s original size, the fact that it compresses so well with gzip -1 suggests to me that there might be a compression method that uses less CPU time while still getting a worth-while amount of compression. If anyone can suggest such a compression method then I would be very interested to try it out. The goal would be a program that can compress 1G of data in significantly less than 10 seconds on a 3.2GHz P4.

Without compression the time taken to transfer 500G of data at Gig-E speeds will probably approach two hours. Not a good amount of down-time for a service that runs 24*7. Particularly given that some time would be spent in getting the new machine to actually use the data.

As for how to design a system to not have these problems, I’ll write a future post with some ideas for how to alleviate that.

18 comments to Moving a Mail Server

  • I have done mailserver moves (on a smaller scale though, but I don’t think it matters much) twice in the past few years and did the following:
    * get new server up & running
    * stop new mail server’s mail service
    * add MX entry for it on relevant domains
    * rsync data while old mail server is running
    * when rsync is done, typically MX entries will be live
    * stop old mail services, migrate DNS for imap etc.
    * rsync data again
    * start mail services on the new host
    * remove MX entry for old host

    Having maildir here meant, in my case, that the second rsync run took hardly more than the time needed to traverse all directories (the new rsync version 3 algorithm helps.)

    If I had been able to, I would have decreased the DNS TTL on the A entries for the imap service etc. beforehand, this way, however, at least mail delivery downtime was at a minimum and imap downtime at “max(DNS TTL, second rsync run)”.

  • > It seems that tar stores the data in a more space efficient manner than the Ext3 filesystem (105MB vs 343MB)

    This is nothing new. Ext3 stores data in (most of the time 4KB) blocks. The downside of Maildir on Ext3 is that most mail fit in less than a block, but still waste a full block.
    Tar doesn’t store data in blocks, it only stores the proper amount of data in the archived file, which is why it takes a lot less data.
    The difference you noticed reminds me an old story with a DOS software which only was 30MB of raw files, but took 300MB of disk space due to FAT16 clusters…
    Note that Maildir has also very big cold-cache performance problems, due to the number of files you will usually have to handle in a given directory.

  • Gabor Gombas

    For Cyrus, rsync is definitely the way to go, as described in the previous reply. My personal mail spool looks like:

    # du -sh /var/spool/cyrus
    2.6G /var/spool/cyrus
    # find /var/spool/cyrus -type f | wc -l
    88089

    I routinely back it up over a GPRS/UMTS link using rsync, and the downtime is usually just a couple of seconds.

  • Rather than piping through a quite CPU intensive “gzip -1”, I’ve had good results with “lzop”.

  • s j west

    If you move imap stores (especially in Cyrus) beware of wrong backends.

    This makes moving Cyrus impa a bit of a madam and unless your moving version to version means it is likely that accounts and and imap sync package (theres one in debian) need to used to transfer from one version to another ok.

  • DBMail is a very interesting third alternative. My preliminary tests have been quite good, and its only a matter of time before I start migrating from maildir.

    Maildir has been good for me, but mysql replication has also been good, and mysql proxy routing capabilities with lua looks promising too!

  • Take an LVM snapshot, backup the mail within that (either the entire LVM image, or the files within it), then you don’t need any downtime.

  • The above is for backups, for migration, you’ll need a bit of logic to handle new incoming messages during the migration – perhaps change the mail server config to duplicate incoming mail in the routing stage, save to a “journal” file you can apply on the new box or simply send copies direct to the new mail server.

  • chithanh

    Ext3 might not be the best filesystem choice for a maildir. Reiserfs supports tail packing which stores small files much more efficiently.

    Also, you may want to look into a filesystem that has replication features, such as ZFS (zpool send)

  • Indeed, I’ve heard resiserfs really murders ext3 in performance terms.

  • You can create a compressed vpn without encryption just to transfer from mail server to backup server. I use openvpn with libzo compress to do it, and it works very well.

    You suggest to use full and differential backups to do this hard backups. I used to do this way:
    – full backup at weekends;
    – differential backups in the other days.

    In Maildir formats, you can save a lot of tapes using differential backups, and backups runs more faster.

    I use bacula (www.bacula.org) to do this job. Bacula can compress the transfer also, in the client.

  • John Allen

    I’ve been doing frequent backups of my Maildirs using the “rsync snapshot” technique (http://www.mikerubel.org/computers/rsync_snapshots/) which uses hardlinking to avoid re-copying unchanged files. It’s true that the first backup is slow but the subsequent snapshots are very fast, since the system only has to copy the new or changed (eg unread->read) files.

  • I would seccond the suggestion of lzop. That is a compression tool very similar to gzip, but it has been designed for very fast compression with network transfers in mind. From my testing some time ago one should be able to saturate a gigabit network with a good computer with lzop compression. And that compression would have only marginally worse compression ratio compared to gzip (up to 10%).

  • etbe

    Johannes: How big was the mail server in question? I’m looking at a server with 6,000,000 inodes per filesystem and from tests with tar it seems that about 40 can be be read entirely per second while the system is operational. So a first-pass of rsync would take a couple of days to complete. Just stat()ing the files gets a maximum rate of ~170/s, so once the data was initially transferred a comparison of inodes would take half a day if no further data was transferred. Of course if the filesystem was idle it would be a lot faster, so it might be viable to do an rsync while the system is running (2 days), then subsequent rsync’s (which take progressively closer to half a day) and then take the machine down for a final rsync (maybe less than 2 hours). But if the final rsync takes more than 2 hours then it would be worse than just copying the block device.

    -b, –blocking-factor N
    use record size of Nx512 bytes (default N=20)
    gladinium: The above extract from the tar man page shows that 10K is the default block size. So it’s worth noting that for efficient large transfers of uncompressed data -b1 should be used.

    Gabor: That’s interesting, however I’m looking at well over 150x that size per filesystem (and there are a number of filesystems).

    Marek and Aigars: I tested out lzop and lzop -1. lzop took 29 seconds to compress 1000M of data and lzop -1 took the same amount of time (the difference was smaller than the precision). lzop produced a file size of 722M and lzop -1 gave 723M. In both cases lzop took significantly more time than a Gig-E transfer.

  • etbe

    s j west: What do you mean? In what situations might I have problems? Do they guarantee that moves from 32bit to 64bit and moves to higher versions will work?

    http://www.dbmail.org/

    Albert: I’m not sure how a database could be better. It seems that when using a database you can transfer records individually (like transferring files with the same seek issues) and you can transfer a copy of the database files (like transferring the contents of the block device used for a filesystem. In either case for basic operations I don’t think that there is any inherent difference – apart of course from the fact that there are a heap of tools (such as rsync) to copy files and only a small number of tools to copy databases. For reference I’ve included the URL for the dbmail project above.

    For actual use there are more benefits to the database (EG imap searches).

    Jon: LVM snapshots are great. But when you have terabytes of data on filesystems that don’t use LVM you have a problem. It’s a pity that there isn’t a way to create an LVM VG which has as a PV an existing block device and a mapping such that the data is all in the same location.

    Jon: Journalling is a nice thing in theory, but when you don’t fully understand a system (I’ve just been hired to migrate a system without prior knowledge) then it’s too risky to consider.

    http://btrfs.wiki.kernel.org/index.php/Main_Page

    chithanh: ReiserFS does have benefits, I’ve run bit mail servers on ReiserFS before. However the ideas about reliability from the ReiserFS team haven’t always impressed me. Btrfs is currently under development (see the above URL) and it supports checksums on data and metadata, space efficient packing of small files, and online filesystem checks. I’m looking forward to the day that it is released. Of course as I have a bunch of clients who use RHEL and RHEL doesn’t even support XFS it might be a long time before I can use it on big systems.

    Jeronimo: That is a possibility, however a VPN adds more CPU time to the mix. For a user-space VPN the data gets sent to the kernel and then via a tunnel device gets returned to user-space which does stuff and then sends it back to the kernel for transmission. For kernel space, I’ve had enough pain with IPSEC already (I don’t know whether it supports compression and don’t care enough to find out). In either case if the compression is packet based then the benefit will be decreased, ideally the compression would give a smaller number of 9K jumbo packets rather than packets of 6K or whatever. If I can’t find a user-space compression program that can give a benefit then I can probably give up on the idea of doing it in a VPN.

    In regard to backups, with the system I’m looking at now, there is no possibility for any type of backup other than weekly backups. A full backup would take more than a week!

    Thanks for all the suggestions, but unless someone can find a compression program faster than lzop it looks like my only option is copying the block device via netcat.

    The next thing of course is to consider how things can be improved for the next time this happens.

  • I’m interested in using DBMail as a way to provide high(-er) availability of mail services, specifically with the ability to route writes to the primary database and allow reads from a local replicant. It terms of performance and storage efficiency, I’m not sure what the differences. The maildir format has worked very well for me, and I’m not about to change, though I am exploring dbmail as another alternative.

    FWIW – I’ve also tested offlineimap with a lot of success. I don’t use it regularly, but it my experiments its worked really well. I was going down a road of trying to use something like libpam-script to trigger an offlineimap sync, which I think could be a more reliable high-availability imap service solution.

    http://mentors.debian.net/cgi-bin/sponsor-pkglist?action=details;package=libpam-script

    Unfortunately, libpam-script is still looking for a sponsor after almost 2 years. :-(

  • etbe

    Albert: Good point about replication. I guess if I had a slave database running on the new machine then I could shut down the mail server, make the secondary database server be the primary, and then restart it.

    It’s a pity that inotify doesn’t seem to be able to monitor all writes on a filesystem, otherwise we could do this in user-space.