From owner-freebsd-stable  Sun Jun 28 23:12:06 1998
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id XAA13869
          for freebsd-stable-outgoing; Sun, 28 Jun 1998 23:12:06 -0700 (PDT)
          (envelope-from owner-freebsd-stable@FreeBSD.ORG)
Received: from pop.uniserve.com (pop.uniserve.com [204.244.156.3])
          by hub.freebsd.org (8.8.8/8.8.8) with SMTP id XAA13864
          for <freebsd-stable@freebsd.org>; Sun, 28 Jun 1998 23:12:05 -0700 (PDT)
          (envelope-from tom@uniserve.com)
Received: from shell.uniserve.ca [204.244.186.218] 
	by pop.uniserve.com with smtp (Exim 1.82 #4)
	id 0yqXAU-00026X-00; Sun, 28 Jun 1998 23:11:58 -0700
Date: Sun, 28 Jun 1998 23:11:54 -0700 (PDT)
From: Tom <tom@uniserve.com>
X-Sender: tom@shell.uniserve.ca
To: "Louis A. Mamakos" <louie@TransSys.COM>
cc: "Michael R. Gile" <gilem@wsg.net>, freebsd-stable@FreeBSD.ORG
Subject: Re: determining ecc errors on freebsd-stable 
In-Reply-To: <199806290549.BAA02456@whizzo.transsys.com>
Message-ID: <Pine.BSF.3.96.980628230424.23093A-100000@shell.uniserve.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On Mon, 29 Jun 1998, Louis A. Mamakos wrote:

> > On Sun, 28 Jun 1998, Michael R. Gile wrote:
> > 
> > > >  There is no way to log ECC corrections are they are done
> > > >transparently in the hardware, and currently there is no mechanism for the
> > > >hardware to make available that kind of info.
> > > 
> > > there must be some status register that records these errors.  Otherwise what 
> > > good is ECC?  If it doesn't tell you that something is wrong, then it is useless 
> > 
> >   Either ECC fixes the error, or if the error is unfixable, the hardware
> > generates a NMI which will cause a panic and reboot.
> > 
> >   Basically, if a fixable error occurs, you won't know about it.  If an
> > unfixable error occurs, you'll know real fast.
> 
> Well, geez, it would be nice to know that you had bum memory in the
> machine so you could replace it at some time of your choosing.  ECC 
> memory ought to be better than just having your system crash later
> rather than sooner.

  Well, you could trap the NMI and kill whatever occupied the offending
location, and make it sure it wasn't used again.  This is an operating
system issue, not a hardware one.

  An NMI panic is MUCH better that "crashing later", as you know precisely
what caused it.  Memory corruption on non-ECC/non-parity systems is very
difficult to track.  Plus, you could be corrupting valuable data in the
process.  With existing ECC systems, at least you get a clean reboot
before anything serious is wreaked.

> This is the kind of thing that seperates toy computers from robust, 
> has to be up no matter what mission critical computers.  

  Yeah, yeah... Sun makes a big deal about this... fact of the matter is,
if you lose some memory containing the kernel you have to reboot anyhow.

  If you don't want a toy computer, you get a cluster anyhow, since there
is way more stuff that can fail than memory (and more often too).

> louie

Tom


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message