From owner-freebsd-hardware  Tue May 21 23:37:05 1996
Return-Path: owner-hardware
Received: (from root@localhost)
          by freefall.freebsd.org (8.7.3/8.7.3) id XAA21336
          for hardware-outgoing; Tue, 21 May 1996 23:37:05 -0700 (PDT)
Received: from GndRsh.aac.dev.com (GndRsh.aac.dev.com [198.145.92.241])
          by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id XAA21322
          for <hardware@FreeBSD.org>; Tue, 21 May 1996 23:37:02 -0700 (PDT)
Received: (from rgrimes@localhost) by GndRsh.aac.dev.com (8.6.12/8.6.12) id XAA13917; Tue, 21 May 1996 23:36:35 -0700
From: "Rodney W. Grimes" <rgrimes@GndRsh.aac.dev.com>
Message-Id: <199605220636.XAA13917@GndRsh.aac.dev.com>
Subject: Re: Triton chipset with 256k cache caches 32M only?
To: barney@databus.com (Barney Wolff)
Date: Tue, 21 May 1996 23:36:34 -0700 (PDT)
Cc: hardware@FreeBSD.org
In-Reply-To: <31a23f350.da6@databus.databus.com> from Barney Wolff at "May 21, 96 05:59:00 pm"
X-Mailer: ELM [version 2.4ME+ PL11 (25)]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-hardware@FreeBSD.org
X-Loop: FreeBSD.org
Precedence: bulk

> The figure of "once in 10 years" was given without any indication of
> what it applies to.  0.1/year/bit? per MB? per SIMM? per 64MB?

The current SER (Soft Error Rate) on 16MBytes of memory using 16Mbit chips
in approximately on the order of 0.1 per year.  (That would be typical of
a pair of 8MB 72 pin simms).

I left the specification fairly ambigous because what it is derived from
several charts, one of which is a chart of data called ``FIT per Bit'' rates
of DRAM's vs technologies.  A FIT is ``Failure In Time per Billion Hours of
operation''.  Another is MTBF due to soft errors vs System hours vs DRAM
density.

Depending on how you want to interprete all this data and what memory
desnsities you are looking at you can come up with a whole lot of different
numbers.  But since I build systems I knew most of what is being built
today as far as FreeBSD Pentium systems are using either 4Mbit or 16Mbit
DRAM technology and typical memory sizes are between 16 and 64MB of memory.

Given that critera your going to see a memory error about once in 10 years,
thats all data allows you to state with signficant accuracy (thats 10
years, not 10.0 years, significant digits applies here, thus anything between
1 in 5.0 and 15.0 years).

> I am familiar with a network of 100 64MB machines, and it sees at least
> a few corrected ECC errors a week, so I suspect the raw error rate
> is much more like 1 a year, if not higher, not 1 a decade.

And how old are these machines, and what density/technology is the
memory.  I suspect we are talking about 1MB DRAM technology (SER is
about 1.2bit/year/2MB).  I also suspect you have some memory in there
that is in pretty bad condition.  A cluster of 50 HP9000/J200's with
384MB to 512MB is each is seeing a ECC error once in a blue moon, I
can't remember the last one it had infact.

Memory FIT rates have improved 2 orders of magnitude between 1Mbit and
16Mbit technologies.

> For almost any purpose, a crash a year is acceptable, if recovery is
> reasonable.  Data corruption is not acceptable.  My net of all this
> is that I'll run with parity if it's faster than ECC, but not run
> with nothing at all.

Thats pretty much what I am telling folks, unless you have something
mission critical enough that you can't with stand 1 crash sometime over
the usefull life (I consider usefull life of current technology <3 years)
of the system attributable to a memory error then run with ECC on, but
then anyone with those types of requirements is going to be doing a lot
more than just ECC memory.


-- 
Rod Grimes                                      rgrimes@gndrsh.aac.dev.com
Accurate Automation Company                 Reliable computers for FreeBSD