From owner-freebsd-hackers  Mon Sep 24 18:23:57 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from peter3.wemm.org (c1315225-a.plstn1.sfba.home.com [24.14.150.180])
	by hub.freebsd.org (Postfix) with ESMTP id 4E99537B491
	for <freebsd-hackers@FreeBSD.ORG>; Mon, 24 Sep 2001 18:23:41 -0700 (PDT)
Received: from overcee.netplex.com.au (overcee.wemm.org [10.0.0.3])
	by peter3.wemm.org (8.11.0/8.11.0) with ESMTP id f8P1NfM20153
	for <freebsd-hackers@FreeBSD.ORG>; Mon, 24 Sep 2001 18:23:41 -0700 (PDT)
	(envelope-from peter@wemm.org)
Received: from wemm.org (localhost [127.0.0.1])
	by overcee.netplex.com.au (Postfix) with ESMTP
	id E7EF63808; Mon, 24 Sep 2001 18:23:40 -0700 (PDT)
	(envelope-from peter@wemm.org)
X-Mailer: exmh version 2.3.1 01/18/2001 with nmh-1.0.4
To: Andrew Gallatin <gallatin@cs.duke.edu>
Cc: Matt Dillon <dillon@earth.backplane.com>,
	freebsd-hackers@FreeBSD.ORG
Subject: Re: ecc on i386 
In-Reply-To: <15279.55878.110154.650940@grasshopper.cs.duke.edu> 
Date: Mon, 24 Sep 2001 18:23:40 -0700
From: Peter Wemm <peter@wemm.org>
Message-Id: <20010925012340.E7EF63808@overcee.netplex.com.au>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG

Andrew Gallatin wrote:
> 
> Matt Dillon writes:
>  > 
>  > :What happens on an ECC equipped PC when you have a multi-bit memory
>  > :error that hardware scrubbing can't fix?  Will there be some sort of
>  > :NMI or something that will panic the box?
>  > :
>  > :I'm used to alphas (where you'll get a fatal machine check panic) and
>  > :I am just wondering if PCs are as safe.
>  > :
>  > :Thanks,
>  > :
>  > :Drew
>  > 
>  >     ECC can typically detect and correct single bit errors and detect
>  >     double bit errors.  Anything beyond that is problematic... it may or
>  >     may not detect the problem or may mis-correct a multi-bit error. 
>  >     An NMI is generated if an uncorrectable error is detected.
>  > 
>  >     On PC's, ECC is optional.  Desktops typically do not ship with ECC
>  >     memory.  Branded servers typically do.    A year or two ago I would
>  >     have been happy to use non-ECC rams (finding bad RAM through trial
>  >     and error), but now with capacities as they are and memory prices down
>  >     ECC is definitely the way to go.
> 
> My sentiments exactly.

I wrote a poller for picking up correction events on various serverworks
motherboards (compaq, tyan) and it was *scarey* how often single-bit errors
were being corrected.

>  >     Bit errors can come from many sources, memory being only one.  Bit err
    ors
>  >     can occur inside the cpu chip, in the L1 and L2 caches, in memory, in
>  >     controller chips... all over the place.  Many modern processors implem
    ent
>  >     parity on their caches to try to cover the problem areas.  I'm not sur
    e
>  >     how Pentium III's and IV's are setup.
>  > 
>  > 						-Matt
> 
> Hmm.. Well, it turns out that the box I"m insterested in (Thunder K7)
> can be set to send an SERR on multiple bit errors.  I wonder what
> happens when a pc gets an SERR? (that's another machine check
> on alpha)

On the Thunder K7, #SERR is routed to NMI.  Trust me, you want this.
And set it to ECC-SCRUB instead of "off" like the default now is.

See my other email about how #SERR is converted to NMI via the ISA part of
the south bridge.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message