Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 28 Aug 2001 10:14:41 -0500
From:      Carroll Kong <damascus@home.com>
To:        freebsd-stable@FreeBSD.ORG
Subject:   Re: 4.4-rc instability
Message-ID:  <5.1.0.14.2.20010828100929.03f6c0c0@netmail.home.com>
In-Reply-To: <200108281011.MAA27326@lurza.secnetix.de>
References:  <20010828111949.B95570@freebie.xs4all.nl>

next in thread | previous in thread | raw e-mail | index | archive | help
At 12:11 PM 8/28/01 +0200, Oliver Fromme wrote:
>Wilko Bulte <wkb@freebie.xs4all.nl> wrote:
>  > On Tue, Aug 28, 2001 at 11:00:37AM +0200, Oliver Fromme wrote:
>  > > Unfortunately, FreeBSD does not support ECC RAM (or did
>  > > that change recently?).
>  >
>  > ? ECC is a hardware function. All of my Alpha machines have ECC memory
>  > (for example).
>
>Sure, I once did have a PC with ECC memory, too.

Right, this is why you need one with logging to the BIOS (not as good, but 
better than nothing), or logging to the OS, which yes, FreeBSD does not 
support yet.  The point of ECC memory is not just that the memory itself is 
necessarily more resilient to errors, but that it can report them.  So, 
WHEN you see your ECC memory correcting errors in the logs, it's time to 
replace it.  When it does not, it is doing fine.  I would say it's best 
feature is the error detection since it can detect double bit errors and 
correct single bit errors.

>When that ECC stick started dying, at first I didn't notice
>at all, because the chipset (i.e. the memory controller in
>the northbridge, I think) corrected the errors silently.
>When the errors grew so that they weren't ECC-correctable
>anymore, processes started dying on sigsegv, and it got
>worse at a fast pace.  Soon I couldn't even boot into
>single-user anymore, because the /bin/sh sigsegved
>instantly.
>
>So, the bottom line is, ECC memory is good as long as there
>are few enough errors that they can be corrected by the
>chipset.  If there are more of them, you're doomed just as
>if you had no ECC in the first place.  At least that's the
>experience of mine with i386 P2/Athlon mainboards.  Alpha
>might be a different story.
>
>Frankly, I expected the machine to halt or freeze with
>something like an NMI or "parity check error", like the old
>PCs with parity SIMMs did.  Would have been better than
>just randomly dying.

But it cannot detect triple bit errors, which is probably what you had, and 
NEITHER could the parity simms which would fail if you got double bit 
errors.  It would just silently say "yahoo we are ok!" since a double bit 
error would "undo" any real error.  Quite primitive.  You got really 
unlucky and probably nailed with the rare triple bit 
error.  ZOUNDS.  Either that or you got double bits, but your machine did 
not log them.  (if your bios or os does not support logging it, you are out 
of luck)

>Even better would be if the operating system recognized the
>correctable errors and log them somewhere, and (_even_
>better!) offer the possibility to disable memory pages with
>known errors.  Tru64 on Alpha supports exactly this.
>Solaris on Sparc does, too.  FreeBSD does not.  That's what
>I meant when I wrote that FreeBSD does not support ECC RAM.
>(I'm sorry, I should have been more elaborate on this.
>Please excuse me.)

Yes, this is correct, this is the RIGHT way to work with ECC.

>I think I still have that broken DIMM somewhere in a
>drawer, and I'm willing to send it to anyone who wants to
>look at it and improve FreeBSD's handling of this (I
>already offered this a few months ago, but got no reply).
>On the other hand, this particular one is probably too
>broken to be even useful for this kind of stuff.
>
>Anyhow, that's my story about ECC memory.
>
>Regards
>    Oliver

I asked about this a while ago too.  The real killer is just that, it is 
very very hard to code up.  :(



-Carroll Kong


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?5.1.0.14.2.20010828100929.03f6c0c0>