In a comment on my post about memory errors Chris Samuel referred me to an interesting post on the Beowulf mailing list about memory errors. In that list posting Joe Landman says “it is pretty easy to deduce which chip is problematic (assuming it is ram) based upon the address” and then describes how to use Machine Check Exception (MCE) data from an error detected/corrected by the ECC system.
Damn the vendors of motherboards for switching to 8-bit RAM just when it was about to be useful to have 9-bit RAM!
286 class machines had 9 bits of RAM per byte with one bit used for parity. Parity errors were extremely rare, largely due to the fact that memory errors could affect more than one bit at a time and therefore would often give a correct parity – if multiple bit errors were totally random then parity might be expected to pass 50% of the time! The Pentium was the first commonly used CPU to operate with a 64bit memory bus. If it had 9 bits per byte it would have had 72 bit wide memory buses – a Hamming Code could use this to detect and correct single-bit errors, detect all double-bit errors, and detect some errors involving more bits. This would mean that some errors would be recoverable and would display the location of the memory problem instead of being fatal and giving no information.
Now it’s become a standard feature in servers to have ECC memory (at significantly greater cost) and most desktop machines don’t have ECC support – I wonder whether this is aimed at price-gouging people who need reliable servers (they can’t use cheap RAM from desktop machines).
Unfortunately due to issues of electricity use, noise, and price I have to run all the servers that are most important to me on desktop PC hardware. Is anyone selling ECC RAM in desktop systems? I am particularly interested in machines that are a couple of years old so I can get them cheap at auction…
You can get a cheap Dell Poweredge server. They are built with ECC memory.
http://www1.ap.dell.com/content/products/compare.aspx/tower_servers?c=au&cs=aubsd1&l=en&s=bsd
Unbuffered ECC memory is cheap. 2GB costs about 90 euros. For example, Intel D975XBX2 motherboard (975X chipset) and the integrated memory controller of Athlon64 support unbuffered ECC with error correction. Sadly, not all Athlon64 motherboards support enabling the ECC features of the memory controller. Many Asus boards support enabling ECC, but some of those have got broken ACPI.
The expensive buffered ECC memory is only needed if you need to install more than 4 DIMMs (Opteron/Xeon).
Forgot to say: Kingston sells Unbuffered ECC in its ValueRAM series. Look for Kingston part numbers KVR667D2E5* for 667MHz DDR2 unbuffered ECC or KVR400X72C3* for 400MHz DDR unbuffered ECC. (* is a wildcard)
Buy your servers on eBay. Businesses don’t tend to buy 2nd hand servers, so you can pick up very nice machines very cheaply. My Myth box is a dual 1.7 gig Xeon! Be aware though, rack-mount servers are NOISY!
Philippe: You are correct that those Dell servers are cheap. But they do have down-sides of having Xeon CPUs (large electricity use) and being significantly more expensive than some of the servers I deploy. A $300 server with RAID is a common requirement for my clients.
Simon: Sure servers are cheap by server standards, but that doesn’t make them cheap by desktop PC standards. Servers are also typically used in service for 5 years and for a minimum of three years while desktop machines are typically used for 3 years in an office environment and 1.5 years in a home environment.
As for a Myth machine being a dual CPU Xeon. Why do you need that? If you have a digital TV card then you shouldn’t need much CPU power. I have talked to people who run multiple channels at the same time on a ~500MHz P3 system!