From owner-freebsd-stable Sun Jun 28 23:12:06 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id XAA13869 for freebsd-stable-outgoing; Sun, 28 Jun 1998 23:12:06 -0700 (PDT) (envelope-from owner-freebsd-stable@FreeBSD.ORG) Received: from pop.uniserve.com (pop.uniserve.com [204.244.156.3]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id XAA13864 for ; Sun, 28 Jun 1998 23:12:05 -0700 (PDT) (envelope-from tom@uniserve.com) Received: from shell.uniserve.ca [204.244.186.218] by pop.uniserve.com with smtp (Exim 1.82 #4) id 0yqXAU-00026X-00; Sun, 28 Jun 1998 23:11:58 -0700 Date: Sun, 28 Jun 1998 23:11:54 -0700 (PDT) From: Tom X-Sender: tom@shell.uniserve.ca To: "Louis A. Mamakos" cc: "Michael R. Gile" , freebsd-stable@FreeBSD.ORG Subject: Re: determining ecc errors on freebsd-stable In-Reply-To: <199806290549.BAA02456@whizzo.transsys.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 29 Jun 1998, Louis A. Mamakos wrote: > > On Sun, 28 Jun 1998, Michael R. Gile wrote: > > > > > > There is no way to log ECC corrections are they are done > > > >transparently in the hardware, and currently there is no mechanism for the > > > >hardware to make available that kind of info. > > > > > > there must be some status register that records these errors. Otherwise what > > > good is ECC? If it doesn't tell you that something is wrong, then it is useless > > > > Either ECC fixes the error, or if the error is unfixable, the hardware > > generates a NMI which will cause a panic and reboot. > > > > Basically, if a fixable error occurs, you won't know about it. If an > > unfixable error occurs, you'll know real fast. > > Well, geez, it would be nice to know that you had bum memory in the > machine so you could replace it at some time of your choosing. ECC > memory ought to be better than just having your system crash later > rather than sooner. Well, you could trap the NMI and kill whatever occupied the offending location, and make it sure it wasn't used again. This is an operating system issue, not a hardware one. An NMI panic is MUCH better that "crashing later", as you know precisely what caused it. Memory corruption on non-ECC/non-parity systems is very difficult to track. Plus, you could be corrupting valuable data in the process. With existing ECC systems, at least you get a clean reboot before anything serious is wreaked. > This is the kind of thing that seperates toy computers from robust, > has to be up no matter what mission critical computers. Yeah, yeah... Sun makes a big deal about this... fact of the matter is, if you lose some memory containing the kernel you have to reboot anyhow. If you don't want a toy computer, you get a cluster anyhow, since there is way more stuff that can fail than memory (and more often too). > louie Tom To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message