Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 25 Sep 2001 09:58:22 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Andrew Gallatin <gallatin@cs.duke.edu>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: ecc on i386 
Message-ID:  <20010925165822.157B03809@overcee.netplex.com.au>
In-Reply-To: <15280.36694.786500.622681@grasshopper.cs.duke.edu> 

next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Gallatin wrote:
> 
> Peter Wemm writes:
> 
> 
> Thanks for your description of how ECC is reported on PCs.  That was
> very, very helpful.
> 
>  > The Tyan Thunder 2510 BIOS even disables ECC -> NMI routing so you have to
>  > go to quite a bit of trouble to reprogram the serverworks chipset to
>  > actually generate NMI's so that you can find out if something got trashed.
> 
> Is that the He-Sl or the LE-3 chipset?  Is that code available?
> I have some LE-3 based boxes which I'd like be certain DTRT.

LE-3 is the one we've been using and the stuff I've tested my hackery
with.  The main problem is that it currently uses magic bit arithmetic
rather than using defined values.  I'll clean it up and get it out.

I am pretty sure it will work for all the serverworks chips, since the docs
for various different chips all describe the same interface.  Similarly,
the Intel 440BX/GX use the same interface, and I suspect the later ones
will as well.  We have ECC/NMI drivers for at least the BX/GX as well.

> Unlike my wife's Dual Athlon, these boxes have nothing in their
> BIOS pertaining to ECC error reporting. (Supermicro 370-DLE)

Serverworks say that ECC *must* be turned on in their manuals.  However,
whether the bioses do this is another thing.

>  > Our NMI / ECC handling really really sucks in FreeBSD. Consider:
>  > - i686_pagezero - reads before writing in order to minimize cache snooping
>  > traffic in SMP systems.  However, if it gets an NMI while trying to check
>  > if the cache line is already zero, it will take the entire machine down
>  > instead of just zeroing the line.
>  > - NFS / VM / bio:  when they get an NMI while trying to copy data that is
>  > clean and backed by storage, they take the machine down instead of trying
>  > to recover and re-read the page.
>  > - userland.. If userland gets an NMI, the machine dies instead of killing
>  > the process (or rereading a text page etc if possible)
>  > - our NMI handlers are a festering pile of excretement.  They dont have
>  > the code to 'ack' the NMI so it isn't possible to return after recovery.
>  > - and so on.
> 
> Well, at least we take the machine down, which is a heck of a lot
> better than ignoring the problem, which is really all that I was
> hoping for. 

I'll email you some code and start doing some cleanup.

> Thanks again,
> 
> Drew

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010925165822.157B03809>