From owner-freebsd-hackers Tue Sep 25 7: 7: 3 2001 Delivered-To: freebsd-hackers@freebsd.org Received: from duke.cs.duke.edu (duke.cs.duke.edu [152.3.140.1]) by hub.freebsd.org (Postfix) with ESMTP id 6DC1D37B40E for ; Tue, 25 Sep 2001 07:06:51 -0700 (PDT) Received: from grasshopper.cs.duke.edu (grasshopper.cs.duke.edu [152.3.145.30]) by duke.cs.duke.edu (8.9.3/8.9.3) with ESMTP id KAA03605; Tue, 25 Sep 2001 10:06:41 -0400 (EDT) Received: (from gallatin@localhost) by grasshopper.cs.duke.edu (8.11.3/8.9.1) id f8PE6EN72757; Tue, 25 Sep 2001 10:06:14 -0400 (EDT) (envelope-from gallatin@cs.duke.edu) From: Andrew Gallatin MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Message-ID: <15280.36694.786500.622681@grasshopper.cs.duke.edu> Date: Tue, 25 Sep 2001 10:06:14 -0400 (EDT) To: Peter Wemm Cc: freebsd-hackers@FreeBSD.ORG Subject: Re: ecc on i386 In-Reply-To: <20010925012041.CC9613808@overcee.netplex.com.au> References: <15279.54029.454089.299807@grasshopper.cs.duke.edu> <20010925012041.CC9613808@overcee.netplex.com.au> X-Mailer: VM 6.75 under 21.1 (patch 12) "Channel Islands" XEmacs Lucid Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Peter Wemm writes: Thanks for your description of how ECC is reported on PCs. That was very, very helpful. > The Tyan Thunder 2510 BIOS even disables ECC -> NMI routing so you have to > go to quite a bit of trouble to reprogram the serverworks chipset to > actually generate NMI's so that you can find out if something got trashed. Is that the He-Sl or the LE-3 chipset? Is that code available? I have some LE-3 based boxes which I'd like be certain DTRT. Unlike my wife's Dual Athlon, these boxes have nothing in their BIOS pertaining to ECC error reporting. (Supermicro 370-DLE) > Our NMI / ECC handling really really sucks in FreeBSD. Consider: > - i686_pagezero - reads before writing in order to minimize cache snooping > traffic in SMP systems. However, if it gets an NMI while trying to check > if the cache line is already zero, it will take the entire machine down > instead of just zeroing the line. > - NFS / VM / bio: when they get an NMI while trying to copy data that is > clean and backed by storage, they take the machine down instead of trying > to recover and re-read the page. > - userland.. If userland gets an NMI, the machine dies instead of killing > the process (or rereading a text page etc if possible) > - our NMI handlers are a festering pile of excretement. They dont have > the code to 'ack' the NMI so it isn't possible to return after recovery. > - and so on. Well, at least we take the machine down, which is a heck of a lot better than ignoring the problem, which is really all that I was hoping for. Thanks again, Drew To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message