Date: Fri, 25 Jul 2008 16:04:16 +0200 From: Erik Trulsson <ertr1013@student.uu.se> To: Michael Powell <nightrecon@verizon.net> Cc: freebsd-questions@freebsd.org Subject: Re: FreeBSD and ECC memory? Message-ID: <20080725140416.GA70841@owl.midgard.homeip.net> In-Reply-To: <g6ck9v$b1b$1@ger.gmane.org> References: <4889BAE0.6030308@skoberne.net> <g6chl3$22s$1@ger.gmane.org> <20080725130052.GA70571@owl.midgard.homeip.net> <g6ck9v$b1b$1@ger.gmane.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Jul 25, 2008 at 09:28:11AM -0400, Michael Powell wrote: > Erik Trulsson wrote: > [snip] > > > > No, non-ECC RAM cannot detect or correct any errors at all. (Old > > parity-RAM could detect, but not correct, single-bit errors.) > > Actually quite true. The old parity bit functionality that was removed from > RAM and then called "non-ECC" actually migrated to the memory controller. > So yes, it isn't the RAM that does it. Poor choice of wording on my part. Not quite. Old parity-RAM usually had an extra parity bit for every 8 data bits. By computing the parity (odd or even number of 1s) in the data bits and comparing it with the value of the parity bit (which got set when you wrote to memory) you could see if any single bit had been flipped. (ECC also uses these extra bits, but uses them in a smarter way.) Non-ECC RAM (as well as older non-parity RAM) does not have these extra bits and therefore you cannot detect any spontaneous bit-flips inside the RAM, since you have nothing to compare the data read against. (The reason non-ECC RAM is more common than ECC RAM is simply that these extra bits require extra chips on the memory module and therefore cost more money - money which most people are not prepared to pay.) (If you count the number of chips on a non-ECC memory module you will find that the number of chips on it is usually a multiple of 8, while on ECC- or parity-RAM it is usually a multiple of 9.) Many modern memory controllers do have parity checking (or even ECC) on the busses between controller and RAM and between controller and CPU. This lets you detect (or even fix) any errors may happen as data is transferred from RAM to CPU. It does not let you detect random errors inside the RAM, which parity or ECC can let you do. > > > ECC is generally capable of detecting multi-bit errors and fixing > > single-bit errors. (There are different ways of implementing ECC. Some of > > them might well be able to fix multi-bit errors too.) > > These cost lots of money. Common on "Big Iron". In fact, non-ECC as an > option isn't even offerred on "B.I". > > [snip] > >> The purpose of these schemes is to compensate for the fact that in every > >> so many (some large number) of memory transactions there may be a bit > >> that gets flipped. If this is happening more often than (some large > >> number) then there is a defect present. ECC just buys you "uptime" in the > >> event there are more errors than there should be. > > > > Note that random, spontaneous bit flips can happen (infrequently) even in > > perfectly good RAM. (Due to cosmic rays, radioactive decay in surrounding > > material, and similar stuff. (No, I am not joking.)) ECC will handle > > such errors just fine, and that is the main reason why I would want ECC. > > Especially true in satellites. The RAM in a satellite, or other spacecraft > must be radiation hardened to be usuable at all. And yes, it is no joke but > the truth what you say. > > For me the dividing line is when lots of people depend on a box 24/7 it must > be ECC. A storage server in someones basement doesn't necessarily fit into > this category. It depends also on what kind of data is stored on the server. One of the really nasty problems that can occur with random bit-flips in non-ECC RAM is that important data can get silently corrupted. You can get an error in your database or spreadsheet or payroll data or whatever without noticing until it is too late (by which time all your backups will probably have this wrong data too.) Depending on the data this can be VERY bad, even if it is a system that is only used occasionally by a few people. Memory errors which cause the computer to crash can be quite disruptive, but they are at least easily noticed, and can then be handled. > > > You can also get defective memory modules, but such can usually be > > detected > > by running memtest86 or similar. ECC can usually handle memory modules > > that have some bits more or less permanently wrong, but such modules > > should be replaced as soon as possible. > > > > I agree - I was kind of harping on the "defective" idea. If it's defective > the manufacturer owes me a replacement, as in yesterday. Yes, and in the (luckily fairly uncommon) case that one of the chips on a memory module suddenly decides to stop working, then ECC can serve the same purpose as RAID does for disks - it allows the system to keep going until you have time to replace the broken part. (Which should be done ASAP since if you get random bit-flips in addition to a broken chip, ECC will not be able to correct those bits.) -- <Insert your favourite quote here.> Erik Trulsson ertr1013@student.uu.se
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080725140416.GA70841>