In a comment on my post about memory errors Chris Samuel referred me to an interesting post on the Beowulf mailing list about memory errors. In that list posting Joe Landman says “it is pretty easy to deduce which chip is problematic (assuming it is ram) based upon the address” and then describes how to use Machine Check Exception (MCE) data from an error detected/corrected by the ECC system.
Damn the vendors of motherboards for switching to 8-bit RAM just when it was about to be useful to have 9-bit RAM!
286 class machines had 9 bits of RAM per byte with one bit used for parity. Parity errors were extremely rare, largely due to the fact that memory errors could affect more than one bit at a time and therefore would often give a correct parity – if multiple bit errors were totally random then parity might be expected to pass 50% of the time! The Pentium was the first commonly used CPU to operate with a 64bit memory bus. If it had 9 bits per byte it would have had 72 bit wide memory buses – a Hamming Code could use this to detect and correct single-bit errors, detect all double-bit errors, and detect some errors involving more bits. This would mean that some errors would be recoverable and would display the location of the memory problem instead of being fatal and giving no information.
Now it’s become a standard feature in servers to have ECC memory (at significantly greater cost) and most desktop machines don’t have ECC support – I wonder whether this is aimed at price-gouging people who need reliable servers (they can’t use cheap RAM from desktop machines).
Unfortunately due to issues of electricity use, noise, and price I have to run all the servers that are most important to me on desktop PC hardware. Is anyone selling ECC RAM in desktop systems? I am particularly interested in machines that are a couple of years old so I can get them cheap at auction…