Early this year I deployed a server. As part of my normal procedure I ran the Memtest86+ memory test program (which showed no errors) before deploying it. After some time running the machine started to become unreliable, yesterday it crashed twice and I had to replace it. I ran Memtest86+ before removing it from where it was installed and found several memory errors. When a server crashes I highly recommend running Memtest86+ before removing it so that you at least know the cause of the problems.
As I want to use the machine elsewhere I want to discover the cause of the problem. The machine has two DIMM sockets (I’ll call them A and B) and two DIMM modules (again I’ll call them A and B). After getting the machine home I first tested the machine with DIMM A in socket A (and DIMM B removed) which passed, then I tested it with DIMM A in socket B which also passed. I removed DIMM A and tested DIMM B in each socket and those tests passed. Then I installed both DIMMs and again the test passed!
I now realise that I made a mistake in removing a DIMM when I got the machine home. I should have tested it again with the DIMMs in place. If the problem was due to heat or a poor contact made worse by vibration then the problem might have gone away during the trip home – it would have been handy to know that I would be unable to reproduce the problem! My mistake here was to change multiple factors at the same time. When diagnosing faults you should try to change one thing at a time so that you will know what fixes it!
Now, I am wondering what I should do next? Assume that it was just a bad contact and put the machine back in service? Suggestions appreciated.
One of the folks on the Beowulf mailing list wrote today (hasn’t hit the archives yet, sorry):
Interesting point, it seems obvious in retrospect but I never considered that before.
I’m not sure to what degree that applies to me because my servers are generally fairly lightly loaded in terms of CPU use (it’s all IO and network bottlenecks). Although I guess that there probably are categories of motherboard problems that show up as memory errors when under DMA load from disk and network.
Here’s the link to Joe Landman’s posting now the nightly archiving run has happened.
http://www.beowulf.org/archive/2007-July/018762.html
I guess there’s the chance of the kernel tripping over a problem through an access pattern when doing buffer cache for I/O that isn’t found by memtest86..