From owner-freebsd-stable Mon Sep 3 23:42:56 2001 Delivered-To: freebsd-stable@freebsd.org Received: from cc415903-b.ebnsk1.nj.home.com (cc415903-b.ebnsk1.nj.home.com [24.180.16.158]) by hub.freebsd.org (Postfix) with SMTP id 32BF437B409 for ; Mon, 3 Sep 2001 23:42:46 -0700 (PDT) Received: (qmail 74393 invoked from network); 28 Aug 2001 14:15:18 -0000 Received: from athena.faerunhome.com (HELO athena.home.com) (192.168.0.2) by cc415903-b.ebnsk1.nj.home.com with SMTP; 28 Aug 2001 14:15:18 -0000 Message-Id: <5.1.0.14.2.20010828100929.03f6c0c0@netmail.home.com> X-Sender: damascus@netmail.home.com X-Mailer: QUALCOMM Windows Eudora Version 5.1 Date: Tue, 28 Aug 2001 10:14:41 -0500 To: freebsd-stable@FreeBSD.ORG From: Carroll Kong Subject: Re: 4.4-rc instability In-Reply-To: <200108281011.MAA27326@lurza.secnetix.de> References: <20010828111949.B95570@freebie.xs4all.nl> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG At 12:11 PM 8/28/01 +0200, Oliver Fromme wrote: >Wilko Bulte wrote: > > On Tue, Aug 28, 2001 at 11:00:37AM +0200, Oliver Fromme wrote: > > > Unfortunately, FreeBSD does not support ECC RAM (or did > > > that change recently?). > > > > ? ECC is a hardware function. All of my Alpha machines have ECC memory > > (for example). > >Sure, I once did have a PC with ECC memory, too. Right, this is why you need one with logging to the BIOS (not as good, but better than nothing), or logging to the OS, which yes, FreeBSD does not support yet. The point of ECC memory is not just that the memory itself is necessarily more resilient to errors, but that it can report them. So, WHEN you see your ECC memory correcting errors in the logs, it's time to replace it. When it does not, it is doing fine. I would say it's best feature is the error detection since it can detect double bit errors and correct single bit errors. >When that ECC stick started dying, at first I didn't notice >at all, because the chipset (i.e. the memory controller in >the northbridge, I think) corrected the errors silently. >When the errors grew so that they weren't ECC-correctable >anymore, processes started dying on sigsegv, and it got >worse at a fast pace. Soon I couldn't even boot into >single-user anymore, because the /bin/sh sigsegved >instantly. > >So, the bottom line is, ECC memory is good as long as there >are few enough errors that they can be corrected by the >chipset. If there are more of them, you're doomed just as >if you had no ECC in the first place. At least that's the >experience of mine with i386 P2/Athlon mainboards. Alpha >might be a different story. > >Frankly, I expected the machine to halt or freeze with >something like an NMI or "parity check error", like the old >PCs with parity SIMMs did. Would have been better than >just randomly dying. But it cannot detect triple bit errors, which is probably what you had, and NEITHER could the parity simms which would fail if you got double bit errors. It would just silently say "yahoo we are ok!" since a double bit error would "undo" any real error. Quite primitive. You got really unlucky and probably nailed with the rare triple bit error. ZOUNDS. Either that or you got double bits, but your machine did not log them. (if your bios or os does not support logging it, you are out of luck) >Even better would be if the operating system recognized the >correctable errors and log them somewhere, and (_even_ >better!) offer the possibility to disable memory pages with >known errors. Tru64 on Alpha supports exactly this. >Solaris on Sparc does, too. FreeBSD does not. That's what >I meant when I wrote that FreeBSD does not support ECC RAM. >(I'm sorry, I should have been more elaborate on this. >Please excuse me.) Yes, this is correct, this is the RIGHT way to work with ECC. >I think I still have that broken DIMM somewhere in a >drawer, and I'm willing to send it to anyone who wants to >look at it and improve FreeBSD's handling of this (I >already offered this a few months ago, but got no reply). >On the other hand, this particular one is probably too >broken to be even useful for this kind of stuff. > >Anyhow, that's my story about ECC memory. > >Regards > Oliver I asked about this a while ago too. The real killer is just that, it is very very hard to code up. :( -Carroll Kong To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message