Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 24 Sep 2001 18:20:41 -0700
From:      Peter Wemm <peter@wemm.org>
To:        Andrew Gallatin <gallatin@cs.duke.edu>
Cc:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: ecc on i386 
Message-ID:  <20010925012041.CC9613808@overcee.netplex.com.au>
In-Reply-To: <15279.54029.454089.299807@grasshopper.cs.duke.edu> 

next in thread | previous in thread | raw e-mail | index | archive | help
Andrew Gallatin wrote:
> 
> What happens on an ECC equipped PC when you have a multi-bit memory
> error that hardware scrubbing can't fix?  Will there be some sort of
> NMI or something that will panic the box?
> 
> I'm used to alphas (where you'll get a fatal machine check panic) and
> I am just wondering if PCs are as safe.

Basically it depends on how the bios has programmed the chipsets and how
the motherboard is wired.

The usual way goes something like this:

There are two PCI signals, #PERR (pci error), #SERR (system error).

Various devices can be programmed to assert these under various conditions.

Things like bus master fifo underflows etc will be programmed to assert #PERR
and are generally not fatal.

The memory controller is usually programmed to assert #SERR on a multiple
bit error and either #SERR or some other signal (a GPIO or something like
#SALERT on a serverworks chip) for a single bit (corrected) error.

The south bridge listens to #SERR and #PERR and can convert those into NMI
events.  Usually #SERR shows up as "parity error" and #PERR shows up as
"IOCHK" (if it is enabled).

The bad news is that many bios manufacturers **TURN OFF** ECC functionality
in order to speed things up.  The reason for this is that with ECC off, the
cpu can read/write down to byte granularity.  With ECC on, memory is
rigidly enforced as 64 bit quantities (ecc-encoded out to 72 bits).  If the
cpu reads a byte, the memory controller actually fetches all 64 (72) bits.
If the cpu writes a byte, the memory controller has to do a
read-merge-write cycle where it reads the 64 bit value, merges in the 1
byte write and writes out the entire 64 bit value again.  This (naturally)
shows up in poor benchmarks so they like to turn it off by default in order
to get a speed edge.  Tyan is a notable example here (eg: the Thunder K7,
the dual-athlon DDR-SDRAM board has ECC turned off by default(!!)).  I am
sure that others do it too.

The Tyan Thunder 2510 BIOS even disables ECC -> NMI routing so you have to
go to quite a bit of trouble to reprogram the serverworks chipset to
actually generate NMI's so that you can find out if something got trashed.

Our NMI / ECC handling really really sucks in FreeBSD. Consider:
- i686_pagezero - reads before writing in order to minimize cache snooping
traffic in SMP systems.  However, if it gets an NMI while trying to check
if the cache line is already zero, it will take the entire machine down
instead of just zeroing the line.
- NFS / VM / bio:  when they get an NMI while trying to copy data that is
clean and backed by storage, they take the machine down instead of trying
to recover and re-read the page.
- userland.. If userland gets an NMI, the machine dies instead of killing
the process (or rereading a text page etc if possible)
- our NMI handlers are a festering pile of excretement.  They dont have
the code to 'ack' the NMI so it isn't possible to return after recovery.
- and so on.

Cheers,
-Peter
--
Peter Wemm - peter@FreeBSD.org; peter@yahoo-inc.com; peter@netplex.com.au
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20010925012041.CC9613808>