Archives

Categories

Improving Computer Reliability

In a comment on my post about Taxing Inferior Products [1] Ben pointed out that most crashes are due to software bugs. Both Ben and I work on the Debian project and have had significant experience of software causing system crashes for Debian users.

But I still think that the widespread adoption of ECC RAM is a good first step towards improving the reliability of the computing infrastructure.

Currently when software developers receive bug reports they always wonder whether the bug was caused by defective hardware. So when bugs can’t be reproduced (or can’t be reproduced in a way that matches the bug report) they often get put in a list of random crash reports and no further attention is paid to them.

When a system has ECC RAM and a filesystem that uses checksums for all data and metadata we can have greater confidence that random bugs aren’t due to hardware problems. For example if a user reports a file corruption bug they can’t repeat that occurred when using the Ext3 filesystem on a typical desktop PC I’ll wonder about the reliability of storage and RAM in their system. If however the same bug report came from someone who had ECC RAM and used the ZFS filesystem then I would be more likely to consider it a software bug.

The current situation is that every part of a typical PC is unreliable. When a bug can be attributed to one of several pieces of hardware, the OS kernel and even malware (in the case of MS Windows) it’s hard to know where to start in tracking down a bug. Most users have given up and accepted that crashing periodically is just what computers do. Even experienced Linux users sometimes give up on trying to track down bugs properly because it’s sometimes very difficult to file a good bug report. For the typical computer user (who doesn’t have the power that a skilled Linux user has) it’s much worse, filing a bug report seems about as useful as praying.

One of the features of ECC RAM is that the motherboard can inform the user (either at boot time, after a NMI reboot, or through system diagnostics) of the problem so it can be fixed. A feature of filesystems such as ZFS and BTRFS is that they can inform the user of drive corruption problems, sometimes before any data is lost.

My recommendation of BTRFS in regard to system integrity does have a significant caveat, currently the system reliability decrease due to crashes outweighs the reliability increase due to checksums at this time. This isn’t all bad because at least when BTRFS crashes you know what the problem is, and BTRFS is rapidly improving in this regard. When I discuss BTRFS in posts like this one I’m considering the theoretical issues related to the design not the practical issues of software bugs. That said I’ve twice had a BTRFS filesystem seriously corrupted by a faulty DIMM on a system without ECC RAM.

7 comments to Improving Computer Reliability

  • shirish

    Dear Russel,
    The only way I know to determine DIMM/memory issues is via memtest86 or its derivatives. Curious to know if that’s the way you came to know or something else ?

  • shirish: On the system where a bad DIMM corrupted a BTRFS filesystem errors were reported by memtest86+. But memtest86+ doesn’t always catch errors, sometimes systems just crash when doing intensive tasks and stop crashing when you replace the RAM.

  • sam

    Is a possible solution to have widespread automated stack trace uploads. For example here is a recent crash I saw in firefox,
    https://crash-stats.mozilla.com/report/list?product=Firefox&signature=XPCWrappedNative%3A%3AFlatJSObjectFinalized%28%29
    With hundreds of similar traces, you can be fairly sure that its real. As a user I did not have to do any more that click ok in the ‘report to developers’ dialog.

  • sam: That looks like a really good solution for the major projects that have significant resources to manage such software and a significant user base to report bugs.

    But if you have few users and bugs only getting reported once then it won’t work so well.

    It would be good if we had something like this in the Debian project (and other Linux distributions) to support such bug reports for many applications.

  • sam

    Fedora ( https://retrace.fedoraproject.org/faf/summary/ ) and Ubuntu ( https://wiki.ubuntu.com/Apport ) have crash reporting systems too. maybe debian could use one of those.

  • Kelly Clowers

    I agree ECC should be way more of a thing than it is. As far as I can tell Althon64 and up for a long time supported ECC, but just recently it seems only Opertons do. Intel only ever supports it in Xeons. My next machine in a few years will be ECC, but it may have to be a server MB to do it, which is a pain, because the server MBs are meant to address very different problems. I’ll need a lot of PCIe add in cards…

    Shirish: Memtest86 et al can be useful but it doesn’t catch things that only happen under heavy load. stress(1) can work pretty well for that, though it could be better. At my last job I saw lots of old Opertons with DDR2 have mem issues that you would see under heavy hadoop load (EDAC errors), but would be clean under memtest86 even overnight. Whereas stress(1) would find it in a few to maybe tens of minutes. Figure out which stick, change it, no more EDAC errors while hadoop was running.

  • Probably I said it before, that BTRFS very rarely produced any trouble since Squeeze, although it is in use on Harddisk only since then, not on SSD. It works reasonably well on USB-connected SD-card, but not on each and every USB-storage key. Those seem to have some kind of queer controller-logic sometimes.
    My FSC Futro S500 Thinclient only crashed once, when I tried to use it, while installing system-updates. This is not recommened at all. Software-upgrades block the system, it is possible to write some text during the process, but one might end up with a broken non-bootable filesystem-root (on compact-flash card). Apart from that the box is very reliable and hardly ever crashes although running the less stable testing-branch of software.
    See there for example:
    https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=745240