Archives

Categories

Memory Errors and Memtest86+

Early this year I deployed a server. As part of my normal procedure I ran the Memtest86+ memory test program (which showed no errors) before deploying it. After some time running the machine started to become unreliable, yesterday it crashed twice and I had to replace it. I ran Memtest86+ before removing it from where it was installed and found several memory errors. When a server crashes I highly recommend running Memtest86+ before removing it so that you at least know the cause of the problems.

As I want to use the machine elsewhere I want to discover the cause of the problem. The machine has two DIMM sockets (I’ll call them A and B) and two DIMM modules (again I’ll call them A and B). After getting the machine home I first tested the machine with DIMM A in socket A (and DIMM B removed) which passed, then I tested it with DIMM A in socket B which also passed. I removed DIMM A and tested DIMM B in each socket and those tests passed. Then I installed both DIMMs and again the test passed!

I now realise that I made a mistake in removing a DIMM when I got the machine home. I should have tested it again with the DIMMs in place. If the problem was due to heat or a poor contact made worse by vibration then the problem might have gone away during the trip home – it would have been handy to know that I would be unable to reproduce the problem! My mistake here was to change multiple factors at the same time. When diagnosing faults you should try to change one thing at a time so that you will know what fixes it!

Now, I am wondering what I should do next? Assume that it was just a bad contact and put the machine back in service? Suggestions appreciated.

3 comments to Memory Errors and Memtest86+

  • One of the folks on the Beowulf mailing list wrote today (hasn’t hit the archives yet, sorry):

    We do on clusters we ship/build. I specifically run a tests to flesh out the memory errors. Sadly, memtest86 only gets the “obvious” errors, you will catch errors with that in most cases fairly quickly. I run several heavy duty (electronic structure) codes that pound on memory and CPU. Using that, we have found many mce errors that memtest86 misses. Most of the mce errors are single bit ecc errors, more often due to timing and access patterns than simple sequential walk through memory (memtest86). Nothing stresses memory like real applications.

  • etbe

    Interesting point, it seems obvious in retrospect but I never considered that before.

    I’m not sure to what degree that applies to me because my servers are generally fairly lightly loaded in terms of CPU use (it’s all IO and network bottlenecks). Although I guess that there probably are categories of motherboard problems that show up as memory errors when under DMA load from disk and network.

  • Here’s the link to Joe Landman’s posting now the nightly archiving run has happened.

    http://www.beowulf.org/archive/2007-July/018762.html

    I guess there’s the chance of the kernel tripping over a problem through an access pattern when doing buffer cache for I/O that isn’t found by memtest86..