From owner-freebsd-hackers  Tue Sep 25  7: 7: 3 2001
Delivered-To: freebsd-hackers@freebsd.org
Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1])
	by hub.freebsd.org (Postfix) with ESMTP id 6DC1D37B40E
	for <freebsd-hackers@FreeBSD.ORG>; Tue, 25 Sep 2001 07:06:51 -0700 (PDT)
Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30])
	by duke.cs.duke.edu (8.9.3/8.9.3) with ESMTP id KAA03605;
	Tue, 25 Sep 2001 10:06:41 -0400 (EDT)
Received: (from gallatin@localhost)
	by grasshopper.cs.duke.edu (8.11.3/8.9.1) id f8PE6EN72757;
	Tue, 25 Sep 2001 10:06:14 -0400 (EDT)
	(envelope-from gallatin@cs.duke.edu)
From: Andrew Gallatin <gallatin@cs.duke.edu>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15280.36694.786500.622681@grasshopper.cs.duke.edu>
Date: Tue, 25 Sep 2001 10:06:14 -0400 (EDT)
To: Peter Wemm <peter@wemm.org>
Cc: freebsd-hackers@FreeBSD.ORG
Subject: Re: ecc on i386 
In-Reply-To: <20010925012041.CC9613808@overcee.netplex.com.au>
References: <15279.54029.454089.299807@grasshopper.cs.duke.edu>
	<20010925012041.CC9613808@overcee.netplex.com.au>
X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-hackers.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-hackers>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-hackers>
X-Loop: FreeBSD.ORG


Peter Wemm writes:


Thanks for your description of how ECC is reported on PCs.  That was
very, very helpful.

 > The Tyan Thunder 2510 BIOS even disables ECC -> NMI routing so you have to
 > go to quite a bit of trouble to reprogram the serverworks chipset to
 > actually generate NMI's so that you can find out if something got trashed.

Is that the He-Sl or the LE-3 chipset?  Is that code available?
I have some LE-3 based boxes which I'd like be certain DTRT.

Unlike my wife's Dual Athlon, these boxes have nothing in their
BIOS pertaining to ECC error reporting. (Supermicro 370-DLE)

 > Our NMI / ECC handling really really sucks in FreeBSD. Consider:
 > - i686_pagezero - reads before writing in order to minimize cache snooping
 > traffic in SMP systems.  However, if it gets an NMI while trying to check
 > if the cache line is already zero, it will take the entire machine down
 > instead of just zeroing the line.
 > - NFS / VM / bio:  when they get an NMI while trying to copy data that is
 > clean and backed by storage, they take the machine down instead of trying
 > to recover and re-read the page.
 > - userland.. If userland gets an NMI, the machine dies instead of killing
 > the process (or rereading a text page etc if possible)
 > - our NMI handlers are a festering pile of excretement.  They dont have
 > the code to 'ack' the NMI so it isn't possible to return after recovery.
 > - and so on.

Well, at least we take the machine down, which is a heck of a lot
better than ignoring the problem, which is really all that I was
hoping for. 

Thanks again,

Drew

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message