I’ve just watched the video of Stewart Smith’s LCA talk Eat My Data about writing applications to store data reliably and not lose it. The reason I watched it was not to learn about how to correctly write such programs, but so that I could recommend it to other people.
Recently I have had problems with a system (that I won’t name) which used fwrite() to write data to disk and then used fflush() to commit it! Below is a section from the fflush(3) man page:
NOTES
Note that fflush() only flushes the user space buffers provided by the
C library. To ensure that the data is physically stored on disk the
kernel buffers must be flushed too, e.g. with sync(2) or fsync(2).
Does no-one read the man pages for library calls that they use?
Then recently I discovered (after losing some data) that both dpkg and rpm do not call fsync() after writing package files to disk. The vast majority of Linux systems use either dpkg or rpm to manage their packages. All those systems are vulnerable to data loss if the power fails, a cluster STONITH event occurs, or any other unexpected reboot happens shortly after a package is installed. This means that you can use the distribution defined interface for installing a package, be told that the package was successfully installed, have a crash or power failure, and then find that only some parts of the package were installed. So far I have agreement from Jeff Johnson that RPM 5 will use fsync(), no agreement from Debian people that this would be a good idea, and I have not yet reported it as a bug in SUSE and Red Hat (I’d rather get it fixed upstream first).
During his talk Stewart says sarcastically “everyone uses the same filesystem because it’s the one true way“. Unfortunately I’m getting this reaction from many people when reporting data consistency issues that arise on XFS. The fact that Ext3 by default will delay writes by up to 5 seconds for performance (which can be changed by a mount option) and that XFS will default to delaying up to 30 seconds means that some race conditions will be more likely to occur on XFS than in the default configuration of Ext3. This doesn’t mean that they won’t occur on Ext3, and certainly doesn’t mean that you can rely on such programs working on Ext3.
Ext3 does however have the data=ordered mount option (which seems to be the default configuration on Debian and on Red Hat systems), this means that meta-data is committed to disk after the data blocks that it referrs to. This means that an operation of writing to a temporary file and then renaming it should give the desired result. Of course it’s bad luck for dpkg and rpm users who use Ext3 but decided to use data=writeback as they get better performance but significantly less reliability.
Also we have to consider the range of filesystems that may be used. Debian supports Linux and HURD kernels as main projects and there are less supported sub-projects for the NetBSD, FreeBSD, and OpenBSD kernels as well as Solaris. Each of these kernels has different implementations of the filesystems that are in common and some have native filesystems that are not supported on Linux at all. It is not reasonable to assume that all of these filesystems have the same caching algorithms as Ext3 or that they are unlike XFS. The RPM tool is mainly known for being used on Red Hat distributions (Fedora and RHEL) and on SuSE – these distributions include support for Ext2/3, ReiserFS, and XFS as root filesystems. RPM is also used on BSD Unix and on other platforms that have different filesystems and different caching algorithms.
One objection that was made to using fsync() was the fact that cheap and nasty hard drives have write-back caches that are volatile (their contents dissappear on power loss). As with such drives reliable operation will be impossible so why not just give up! Pity about the people with good hard drives that don’t do such foolishness, maybe they are expected to lose data as an expression of solidarity with people who buy the cheap and nasty hardware.
Package installation would be expected to be slower if all files are sync’d. One method of mitigating this is to write a large number of files (EG up to a maximum of 900) and then call fsync() on each of them in a loop. After the last file has been written the first file may have been entirely committed to disk, and calling fsync() on one file may result in other files being synchronised too. Another issue is that the only time package installation speed really matters is during an initial OS install. It should not be difficult to provide an option to not call fsync() for use during the OS install (where any error would result in aborting the install anyway).
Update: If you are interested in disk performance then you might want to visit the Benchmark category of my blog, my Bonnie++ storage benchmark and my Postal mail server benchmark.
Update: This is the most popular post I’ve written so far. I would appreciate some comments about what you are interested in so I can write more posts that get such interest. Also please see the Future Posts page for any other general suggestions.