Early this year I deployed a server. As part of my normal procedure I ran the Memtest86+ memory test program (which showed no errors) before deploying it. After some time running the machine started to become unreliable, yesterday it crashed twice and I had to replace it. I ran Memtest86+ before removing it from where it was installed and found several memory errors. When a server crashes I highly recommend running Memtest86+ before removing it so that you at least know the cause of the problems.
As I want to use the machine elsewhere I want to discover the cause of the problem. The machine has two DIMM sockets (I’ll call them A and B) and two DIMM modules (again I’ll call them A and B). After getting the machine home I first tested the machine with DIMM A in socket A (and DIMM B removed) which passed, then I tested it with DIMM A in socket B which also passed. I removed DIMM A and tested DIMM B in each socket and those tests passed. Then I installed both DIMMs and again the test passed!
I now realise that I made a mistake in removing a DIMM when I got the machine home. I should have tested it again with the DIMMs in place. If the problem was due to heat or a poor contact made worse by vibration then the problem might have gone away during the trip home – it would have been handy to know that I would be unable to reproduce the problem! My mistake here was to change multiple factors at the same time. When diagnosing faults you should try to change one thing at a time so that you will know what fixes it!
Now, I am wondering what I should do next? Assume that it was just a bad contact and put the machine back in service? Suggestions appreciated.