From owner-freebsd-stable  Mon Jun 29 08:44:03 1998
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id IAA07610
          for freebsd-stable-outgoing; Mon, 29 Jun 1998 08:44:03 -0700 (PDT)
          (envelope-from owner-freebsd-stable@FreeBSD.ORG)
Received: from pop.uniserve.com (pop.uniserve.com [204.244.156.3])
          by hub.freebsd.org (8.8.8/8.8.8) with SMTP id IAA07583
          for <freebsd-stable@freebsd.org>; Mon, 29 Jun 1998 08:43:58 -0700 (PDT)
          (envelope-from tom@uniserve.com)
Received: from shell.uniserve.ca [204.244.186.218] 
	by pop.uniserve.com with smtp (Exim 1.82 #4)
	id 0yqg4e-0003qG-00; Mon, 29 Jun 1998 08:42:32 -0700
Date: Mon, 29 Jun 1998 08:42:31 -0700 (PDT)
From: Tom <tom@uniserve.com>
X-Sender: tom@shell.uniserve.ca
To: Peter Jeremy <peter.jeremy@alcatel.com.au>
cc: freebsd-stable@FreeBSD.ORG
Subject: Re: determining ecc errors on freebsd-stable
In-Reply-To: <199806290401.OAA02134@gsms01.alcatel.com.au>
Message-ID: <Pine.BSF.3.96.980629083014.25711A-100000@shell.uniserve.ca>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


On Mon, 29 Jun 1998, Peter Jeremy wrote:

> On Sun, 28 Jun 1998 19:57:26 -0700 (PDT), Tom <tom@uniserve.com> wrote:
> >  Basically, if a fixable error occurs, you won't know about it.  If an
> >unfixable error occurs, you'll know real fast.
> 
> Which substantially reduces the usefulness of ECC.  It may increase
> the MTBF (since a single-bit failure is now hidden), but it no longer

  It will increase MTBF a lot, because you know you have a memory problem
after one crash and reboot (outage: < 5 minutes), as compared to non-ECC
where you will probably have to go through a dozen application crashes and
a few system hangs and/or panics.  System hangs are the worst (outage:
until someone notices and gets down to power cycle the machine).

> provides fault tolerance since you can't detect a memory module that
> is getting flaky (or has a hard error).

  Huh?  "no longer provides fault tolerance"?  How does it provide fault
tolerance?  If memory fails, something has to die.  The FreeBSD approach
of simply rebooting is a bit drastic, but at minimum you have to kill
whatever process (assuming it isn't the kernel) is occupying that memory.

  Also, using single bit errrors to detect "flaky but still working 
modules" doesn't hold much wait with me.  Why?  Either memory works or it
doesn't.  "flaky" memory typically is heat triggered, not random.  I have
a bunch of 24x7 servers with parity memory (which will crash on even
single bit errror), and memory failure is rare.

  Summary:

- ECC is MUCH better than non-ECC
- Memory failure is rare.  FreeBSD still doesn't have multi-path IO to
recover from controler card failure, which occurs much more often.  Or,
clustering which can protect against software failures (which are still
much common than any kind of hardware failure).  So putting so much
emphasis on ECC is unnecessary.

> Yet another design engineer for the firing squad...

  Still waiting for someone's patches to FreeBSD...

> Peter
> --
> Peter Jeremy (VK2PJ)                    peter.jeremy@alcatel.com.au
> Alcatel Australia Limited
> 41 Mandible St                          Phone: +61 2 9690 5019
> ALEXANDRIA  NSW  2015                   Fax:   +61 2 9690 5247

Tom


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message