From owner-freebsd-hardware Tue May 21 23:37:05 1996 Return-Path: owner-hardware Received: (from root@localhost) by freefall.freebsd.org (8.7.3/8.7.3) id XAA21336 for hardware-outgoing; Tue, 21 May 1996 23:37:05 -0700 (PDT) Received: from GndRsh.aac.dev.com (GndRsh.aac.dev.com [198.145.92.241]) by freefall.freebsd.org (8.7.3/8.7.3) with SMTP id XAA21322 for ; Tue, 21 May 1996 23:37:02 -0700 (PDT) Received: (from rgrimes@localhost) by GndRsh.aac.dev.com (8.6.12/8.6.12) id XAA13917; Tue, 21 May 1996 23:36:35 -0700 From: "Rodney W. Grimes" Message-Id: <199605220636.XAA13917@GndRsh.aac.dev.com> Subject: Re: Triton chipset with 256k cache caches 32M only? To: barney@databus.com (Barney Wolff) Date: Tue, 21 May 1996 23:36:34 -0700 (PDT) Cc: hardware@FreeBSD.org In-Reply-To: <31a23f350.da6@databus.databus.com> from Barney Wolff at "May 21, 96 05:59:00 pm" X-Mailer: ELM [version 2.4ME+ PL11 (25)] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-hardware@FreeBSD.org X-Loop: FreeBSD.org Precedence: bulk > The figure of "once in 10 years" was given without any indication of > what it applies to. 0.1/year/bit? per MB? per SIMM? per 64MB? The current SER (Soft Error Rate) on 16MBytes of memory using 16Mbit chips in approximately on the order of 0.1 per year. (That would be typical of a pair of 8MB 72 pin simms). I left the specification fairly ambigous because what it is derived from several charts, one of which is a chart of data called ``FIT per Bit'' rates of DRAM's vs technologies. A FIT is ``Failure In Time per Billion Hours of operation''. Another is MTBF due to soft errors vs System hours vs DRAM density. Depending on how you want to interprete all this data and what memory desnsities you are looking at you can come up with a whole lot of different numbers. But since I build systems I knew most of what is being built today as far as FreeBSD Pentium systems are using either 4Mbit or 16Mbit DRAM technology and typical memory sizes are between 16 and 64MB of memory. Given that critera your going to see a memory error about once in 10 years, thats all data allows you to state with signficant accuracy (thats 10 years, not 10.0 years, significant digits applies here, thus anything between 1 in 5.0 and 15.0 years). > I am familiar with a network of 100 64MB machines, and it sees at least > a few corrected ECC errors a week, so I suspect the raw error rate > is much more like 1 a year, if not higher, not 1 a decade. And how old are these machines, and what density/technology is the memory. I suspect we are talking about 1MB DRAM technology (SER is about 1.2bit/year/2MB). I also suspect you have some memory in there that is in pretty bad condition. A cluster of 50 HP9000/J200's with 384MB to 512MB is each is seeing a ECC error once in a blue moon, I can't remember the last one it had infact. Memory FIT rates have improved 2 orders of magnitude between 1Mbit and 16Mbit technologies. > For almost any purpose, a crash a year is acceptable, if recovery is > reasonable. Data corruption is not acceptable. My net of all this > is that I'll run with parity if it's faster than ECC, but not run > with nothing at all. Thats pretty much what I am telling folks, unless you have something mission critical enough that you can't with stand 1 crash sometime over the usefull life (I consider usefull life of current technology <3 years) of the system attributable to a memory error then run with ECC on, but then anyone with those types of requirements is going to be doing a lot more than just ECC memory. -- Rod Grimes rgrimes@gndrsh.aac.dev.com Accurate Automation Company Reliable computers for FreeBSD