ECC RAM is more useful than RAID

A common myth in the computer industry seems to be that ECC (Error Correcting Code – a Hamming Code [0]) RAM is only a server feature.

The difference between a server and a desktop machine (in terms of utility) is that a server performs tasks for many people while a desktop machine only performs tasks for one person. Therefore when purchasing a desktop machine you can decide how much you are willing to spend for the safety and continuity of your work. For a server it’s more difficult as everyone has a different idea of how reliable a server should be in terms of uptime and in terms of data security. When running a server for a business there is the additional issue of customer confidence. If a server goes down occasionally customers start wondering what else might be wrong and considering whether they should trust their credit card details to the online ordering system.

So it is obviously apparent that servers need a different degree of reliability – and it’s easy to justify spending the money.

Desktop machines also need reliability, more so than most people expect. In a business when a desktop machine crashes it wastes employee time. If a crash wastes an hour (which is not unlikely given that previously saved work may need to be re-checked) then it can easily cost the business $100 (the value of the other work that the employee might have done). Two such crashes per week could cost the business as much as $8000 per year. The price difference between a typical desktop machine and a low-end workstation (or deskside server) is considerably less than that (when I investigated the prices almost a year ago desktop machines with server features ranged in price from $800 to $2400 [1]).

Some machines in a home environment need significant reliability. For example when students are completing high-school their assignments have a lot of time invested in them. Losing an assignment due to a computer problem shortly before it’s due in could impact their ability to get a place in the university course that they most desire! Then there is also data which is irreplaceable, one example I heard of was of a woman who’s computer had a factory pre-load of Windows, during a storm the machine rebooted and reinstalled itself to the factory defaults – wiping several years of baby photos… In both cases better backups would mostly solve the problem.

For business use the common scenario is to have file servers storing all data and have very little data stored on the PC (ideally have no data on the PC). In this case a disk error would not lose any data (unless the swap space was corrupted and something important was paged out when the disk failed). For home use the backup requirements are quite small. If a student is working on an important assignment then they can back it up to removable media whenever they reach a milestone. Probably the best protection against disk errors destroying assignments would be a bulk purchase of USB flash storage sticks.

Disk errors are usually easy to detect. Most errors are in the form of data which can not be read back, when that happens the OS will give an error message to the user explaining what happened. Then if you have good backups you revert to them and hope that you didn’t lose too much work in the mean-time (you also hope that your backups are actually readable – but that’s another issue). The less common errors are lost-writes – where the OS writes data to disk but the disk doesn’t store it. This is a little more difficult to discover as the drive will return bad data (maybe an old version of the file data or maybe data from a different file) and claim it to be good.

The general idea nowadays is that a filesystem should check the consistency of the data it returns. Two new filesystems, ZFS from Sun [2] and BTRFS from Oracle [3] implement checksums of data stored on disk. ZFS is apparently production ready while BTRFS is apparently not nearly ready. I expect that from now on whenever anyone designs a filesystem for anything but the smallest machines (EG PDAs and phones) they will include data integrity mechanisms in the design.

I believe that once such features become commonly used the need for RAID on low-end systems will dramatically decrease. A combination of good backups and knowing when your live data is corrupted will often be a good substitute for preserving the integrity of the live data. Not that RAID will necessarily protect your data – with most RAID configurations if a hard disk returns bad data and claims it to be good (the case of lost writes) then the system will not read data from other disks for checksum validation and the bad data will be accepted.

It’s easy to compute checksums of important files and verify them later. One simple way of doing so is to compress the files, every file compression program that I’ve seen has some degree of error detection.

Now the real problem with RAM which lacks ECC is that it can lose data without the user knowing. There is no possibility of software checks because any software which checks for data integrity could itself be mislead by memory errors. I once had a machine which experienced filesystem corruption on occasion, eventually I discovered that it had a memory error (memtest86+ reported a problem). I will never know whether some data was corrupted on disk because of this. Sifting through a large amount of stored data for some files which may have been corrupted due to memory errors is almost impossible. Especially when there was a period of weeks of unreliable operation of the machine in question.

Checking the integrity of file data by using the verify option of a file compression utility, fsck on a filesystem that stores checksums on data, or any of the other methods is not difficult.

I have a lot of important data on machines that don’t have ECC. One reason is that machines which have ECC cost more and have other trade-offs (more expensive parts, more noise, more electricity use, and the small supply makes it difficult to get good deals). Another is that there appear to be no laptops which support ECC (I use a laptop for most of my work). On the other hand RAID is very cheap and simple to implement, just buy a second hard disk and install software RAID – I think that all modern OSs support RAID as a standard installation option. So in spite of the fact that RAID does less good than a combination of ECC RAM and good backups (which are necessary even if you have RAID), it’s going to remain more popular in high-end desktop systems for a long time.

The next development that seems interesting is the large portion of the PC market which is designed not to have the space for more than one hard disk. Such compact machines (known as Small Form Factor or SFF) could easily be designed to support ECC RAM. Hopefully the PC companies will add reliability features in one area while removing them in another.

Carsten Aulbert

June 13, 2008 at 02:13

Hi Russel,

good article (since we lost quite a bit of data on server hardware with faulty memory – even though it was ECC but many bit errors can simply not be corrected).

One thing you seem to miss is the path in-between. Usually, people at home still use the dated PCI bus which does not do any error correction at all and usually only CRC32(?) checks which is *just* better than nothing. Even PCI-X does not give you anything except V3.0 which is rarely found anywhere. PCIe is now becoming better in that respect, but in principle you can have the perfect ECC memory and the perfect file system on a perfect RAID controller, but still there is rubbish on the disk in the end.

Timo Lindfors

June 13, 2008 at 06:44

Very thoughtful article. While debugging mysterious data corruption on an nfs server I put together a loop device that does crc checks, in case you are interested it is available at http://iki.fi/lindi/darcs/crcloop/

Jon

June 13, 2008 at 20:20

One disadvantage of compressing files for checksum verification is, if you have a corrupted text file (for example), often times you can rescue part of the file (or the corruption is evident and correctable by a human). A compressed file which is corrupted might not be de-compressable depending on the algorithm and the tools used, even if only a small fraction of the bits have been corrupted.

Ron

June 22, 2008 at 21:41

Ha ha a womans PC magically re-installed an operating system. Not even windows is that bad.

etbe

June 23, 2008 at 10:14

http://en.wikipedia.org/wiki/PCIE

Carsten: There are some potential failure conditions that can defeat ECC RAM. One is if a write is entirely lost (the previous data would be there with ECC intact). I have no idea what the probability of this might be. Also errors with more than one bit can be unrecoverable – but this is not a failure of ECC as it’s better to lose data entirely (EG a SEGV of an application) than to get silent corruption.

According to the above Wikipedia page PCIe only uses a CRC. As for filesystem corruption, if you use ZFS or BTRFS then the filesystem will checksum all data. It seems likely that the combination of CRC on PCIe transfers and whatever checksum ZFS and BTRFS use will be less likely to break than a single CRC.

Timo: Interesting, thanks for the link.

Jon: Good point. However if you have copies of a compressed file on two different media then it seems likely that the chance of getting good data at the end is better than a single uncompressed copy.

Ron: It’s not just windows but the Windows pre-load system. Many machines used to be sold (not sure if they still are) with an option to recover the OS install of a pre-load. The idea is that if you messed up your machine you could go to the BIOS setup and ask the machine to reinstall Windows – among other things that saves you paying another license fee to MS.

MValdez

June 23, 2008 at 13:05

Hi. Good article. I have just read elsewhere that Mac OS X Server will use ZFS as file system; maybe that will push other vendors to provide also a checksumed/self-healing file system. (Too bad for its license though, we would already have ZFS in Linux).

I usually urge my clients to install ECC memory on every server and workstation. ECC memory is not really that expensive and prices keep dropping. As for motherboards, I usually get ASUS motherboards because of their support for ECC RAM (highly customizable in the BIOS), number of RAM slots and Linux compatibility. (I don’t know however, if their laptops can use ECC memory; at least the EEE PC don’t.)

Regards,

June 23, 2008 at 14:36

MValdez: I am not aware of there ever being a mass-market laptop which supported ECC RAM. I would not be surprised if there was some outrageously expensive mil-spec laptop with ECC RAM, but I would be utterly shocked if there was a laptop I could reasonably afford which has it.

BTRFS is a good thing and we can only expect other filesystems to add similar features. It would not surprise me if checksums in filesystems became a standard feature in as little as 10 years. :-#

ECC memory is not that expensive. Computers supporting it often are. If I was prepared to buy white-box machines then I would get one of those ASUS ones to which you refer. But as I prefer name-brand machines that means Dell PowerEdge and HP XW machines – both of which are a little expensive (the PowerEdge is cheap as a base unit but extra bits are unreasonably expensive).

etbe – Russell Coker

Archives

Categories

ECC RAM is more useful than RAID

7 comments to ECC RAM is more useful than RAID

Archives

Email and RSS

etbe – Russell Coker

Archives

Categories

Tags

ECC RAM is more useful than RAID

7 comments to ECC RAM is more useful than RAID

Archives

Email and RSS