Date: Tue, 28 Aug 2001 12:11:02 +0200 (CEST) From: Oliver Fromme <olli@secnetix.de> To: freebsd-stable@FreeBSD.ORG Subject: Re: 4.4-rc instability Message-ID: <200108281011.MAA27326@lurza.secnetix.de> In-Reply-To: <20010828111949.B95570@freebie.xs4all.nl>
next in thread | previous in thread | raw e-mail | index | archive | help
Wilko Bulte <wkb@freebie.xs4all.nl> wrote: > On Tue, Aug 28, 2001 at 11:00:37AM +0200, Oliver Fromme wrote: > > Unfortunately, FreeBSD does not support ECC RAM (or did > > that change recently?). > > ? ECC is a hardware function. All of my Alpha machines have ECC memory > (for example). Sure, I once did have a PC with ECC memory, too. When that ECC stick started dying, at first I didn't notice at all, because the chipset (i.e. the memory controller in the northbridge, I think) corrected the errors silently. When the errors grew so that they weren't ECC-correctable anymore, processes started dying on sigsegv, and it got worse at a fast pace. Soon I couldn't even boot into single-user anymore, because the /bin/sh sigsegved instantly. At first I thought the processor had gone bad, or maybe the mainboard itself (a Gigabyte dual P2 board, intel 440BX chipset). I believed in ECC, so the RAM was no suspect to me at that time. I had seen bad ECC memory in Sun Sparc workstations running Solaris at the university, which started logging "bad memory page, ECC error" or similar in the syslog, and automatically disabled that particular page if a certain number of errors had occured on it. That was a very cool feature, I thought. Finally I ripped my DIMM out and put it into a different board (an MSI Athlon board with AMD chipset, i.e. different design, different processor, different BIOS). Guess what? It failed in the same ways. So it was indeed the fault of the ECC memory. I took it to a computer shop where a hardware memory tester was available, which confirmed that this DIMM had gone foobar. Since then, I never bought expensive ECC memory again, but instead preferred well-known brands (such as Infineon). They're less expensive, and I've never had any memory problems ever since then. So, the bottom line is, ECC memory is good as long as there are few enough errors that they can be corrected by the chipset. If there are more of them, you're doomed just as if you had no ECC in the first place. At least that's the experience of mine with i386 P2/Athlon mainboards. Alpha might be a different story. Frankly, I expected the machine to halt or freeze with something like an NMI or "parity check error", like the old PCs with parity SIMMs did. Would have been better than just randomly dying. Even better would be if the operating system recognized the correctable errors and log them somewhere, and (_even_ better!) offer the possibility to disable memory pages with known errors. Tru64 on Alpha supports exactly this. Solaris on Sparc does, too. FreeBSD does not. That's what I meant when I wrote that FreeBSD does not support ECC RAM. (I'm sorry, I should have been more elaborate on this. Please excuse me.) I think I still have that broken DIMM somewhere in a drawer, and I'm willing to send it to anyone who wants to look at it and improve FreeBSD's handling of this (I already offered this a few months ago, but got no reply). On the other hand, this particular one is probably too broken to be even useful for this kind of stuff. Anyhow, that's my story about ECC memory. Regards Oliver -- Oliver Fromme, secnetix GmbH & Co KG, Oettingenstr. 2, 80538 München Any opinions expressed in this message may be personal to the author and may not necessarily reflect the opinions of secnetix in any way. "All that we see or seem is just a dream within a dream" (E. A. Poe) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200108281011.MAA27326>