Committing Data to Disk

I’ve just watched the video of Stewart Smith’s LCA talk Eat My Data about writing applications to store data reliably and not lose it. The reason I watched it was not to learn about how to correctly write such programs, but so that I could recommend it to other people.

Recently I have had problems with a system (that I won’t name) which used fwrite() to write data to disk and then used fflush() to commit it! Below is a section from the fflush(3) man page:

       Note that fflush() only flushes the user space buffers provided by  the
       C  library.   To  ensure that the data is physically stored on disk the
       kernel buffers must be flushed too, e.g. with sync(2) or fsync(2).

Does no-one read the man pages for library calls that they use?

Then recently I discovered (after losing some data) that both dpkg and rpm do not call fsync() after writing package files to disk. The vast majority of Linux systems use either dpkg or rpm to manage their packages. All those systems are vulnerable to data loss if the power fails, a cluster STONITH event occurs, or any other unexpected reboot happens shortly after a package is installed. This means that you can use the distribution defined interface for installing a package, be told that the package was successfully installed, have a crash or power failure, and then find that only some parts of the package were installed. So far I have agreement from Jeff Johnson that RPM 5 will use fsync(), no agreement from Debian people that this would be a good idea, and I have not yet reported it as a bug in SUSE and Red Hat (I’d rather get it fixed upstream first).

During his talk Stewart says sarcastically “everyone uses the same filesystem because it’s the one true way“. Unfortunately I’m getting this reaction from many people when reporting data consistency issues that arise on XFS. The fact that Ext3 by default will delay writes by up to 5 seconds for performance (which can be changed by a mount option) and that XFS will default to delaying up to 30 seconds means that some race conditions will be more likely to occur on XFS than in the default configuration of Ext3. This doesn’t mean that they won’t occur on Ext3, and certainly doesn’t mean that you can rely on such programs working on Ext3.

Ext3 does however have the data=ordered mount option (which seems to be the default configuration on Debian and on Red Hat systems), this means that meta-data is committed to disk after the data blocks that it referrs to. This means that an operation of writing to a temporary file and then renaming it should give the desired result. Of course it’s bad luck for dpkg and rpm users who use Ext3 but decided to use data=writeback as they get better performance but significantly less reliability.

Also we have to consider the range of filesystems that may be used. Debian supports Linux and HURD kernels as main projects and there are less supported sub-projects for the NetBSD, FreeBSD, and OpenBSD kernels as well as Solaris. Each of these kernels has different implementations of the filesystems that are in common and some have native filesystems that are not supported on Linux at all. It is not reasonable to assume that all of these filesystems have the same caching algorithms as Ext3 or that they are unlike XFS. The RPM tool is mainly known for being used on Red Hat distributions (Fedora and RHEL) and on SuSE – these distributions include support for Ext2/3, ReiserFS, and XFS as root filesystems. RPM is also used on BSD Unix and on other platforms that have different filesystems and different caching algorithms.

One objection that was made to using fsync() was the fact that cheap and nasty hard drives have write-back caches that are volatile (their contents dissappear on power loss). As with such drives reliable operation will be impossible so why not just give up! Pity about the people with good hard drives that don’t do such foolishness, maybe they are expected to lose data as an expression of solidarity with people who buy the cheap and nasty hardware.

Package installation would be expected to be slower if all files are sync’d. One method of mitigating this is to write a large number of files (EG up to a maximum of 900) and then call fsync() on each of them in a loop. After the last file has been written the first file may have been entirely committed to disk, and calling fsync() on one file may result in other files being synchronised too. Another issue is that the only time package installation speed really matters is during an initial OS install. It should not be difficult to provide an option to not call fsync() for use during the OS install (where any error would result in aborting the install anyway).

Update: If you are interested in disk performance then you might want to visit the Benchmark category of my blog, my Bonnie++ storage benchmark and my Postal mail server benchmark.

Update: This is the most popular post I’ve written so far. I would appreciate some comments about what you are interested in so I can write more posts that get such interest. Also please see the Future Posts page for any other general suggestions.

10 comments to Committing Data to Disk

  • On the other hand, an fsync causes the harddrive to spin up if it is spun down. No problem on server with continuously spinning drives, but on my notebook on battery, I like my hard drive to be spun down as long as possible. It is applications like evolution that use fsync too much (e.g. for IMAP cache files) that prevent my harddrive to really spin down while I’m just browsing and doing e-mails.

    So when convincing people to use fsync, please only convince them if it is really critical.

  • Jon

    RPM 5 is not upstream for Redhat: it’s a fork (or redhat forked, depending on which side of the line you are standing).

  • etbe

    fsync() will make the hard drive spin-up. That’s OK if it means that the system operates correctly. However if data is cached from another source then it has less need for reliability and could in fact be stored only in RAM.

    But there is no reason for not synchronising important operations such as OS package install.

  • chithanh

    Even cheap SATA drives support write barriers nowadays. To avoid losing data on power outage, use an UPS.

  • etbe

    A UPS doesn’t help the case of a cluster STONITH event (the cause of data loss in my situation). It also doesn’t solve the many cases of user error that can cause power outages (I’ve had a couple of networks go down because one UPS was tested while the other was apparently not working).

    Write barriers are a nice feature when they work well. Unfortunately there’s a lot of RAID hardware that doesn’t support them – or even worse claims to support them but actually doesn’t.

  • chithanh

    Yeah, more often than not, that hardware RAID snafu is best left with write cache disabled and/or in JBOD mode.

    Concerning the cluster failure, I don’t think this is common enough to warrant a performance downgrade for everyone. If you know you are running a sensitive setup, you can always issue a sync command after package installation.

  • etbe

    It surprises me that you oppose calling fsync() because of a “performance downgrade” even though installing new packages is not THAT common an operation while you also suggest disabling the write-back cache on RAID hardware which significantly decreases performance for all common operations.

    The performance benefits of write-back cache in a hardware RAID are massive, while the performance issues of using fsync() or not are not that very significant in most cases.

  • chithanh

    It seems there is a misunderstanding here :-)

    Putting an fsync() command in apt-get will affect all users of the software. Slightly modified quote from above, “maybe they are expected to lose performance as an expression of solidarity with people who buy the expensive yet nasty hardware”

    Disabling write back cache in broken RAID controllers only affects users dumb enough to buy such hardware.

  • etbe

    Putting a fsync() in dpkg (apt-get won’t work) will give correct behaviour in all cases. The only other way of giving such correct behaviour is to mount all filesystems with the “sync” option. A “sync” mount of a filesystem will probably hurt performance more than disabling a RAID cache.

    Incidentally most users won’t know whether their RAID cache works properly. Even OS vendors often don’t know the implementation details of the RAID devices that they support.

  • chithanh, are you saying that fsync() is only necessary if you have enabled your faulty write-back cache? You have that backwards – with broken hardware, fsync() and mount options are totally useless – they can guarantee nothing. This discussion is about the proper way to get correct behavior for those of us with hardware that works.

    I’ll repeat that for emphasis: with broken hardware, correct behavior is hopeless. fsync() or fancy mount options are necessary to guarantee correct behavior on working hardware. I prefer fsync(). Since they don’t use it, dpkg and rpm are broken, especially considering that they’re developed for distributions which do not use the fancy mount options by default.

    etbe, there are test programs to determine if your cache is working properly. You’re right that most people don’t know, but they could and should do a fairly reliable and easy test.