From owner-freebsd-stable Mon Jun 29 08:44:03 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id IAA07610 for freebsd-stable-outgoing; Mon, 29 Jun 1998 08:44:03 -0700 (PDT) (envelope-from owner-freebsd-stable@FreeBSD.ORG) Received: from pop.uniserve.com (pop.uniserve.com [204.244.156.3]) by hub.freebsd.org (8.8.8/8.8.8) with SMTP id IAA07583 for ; Mon, 29 Jun 1998 08:43:58 -0700 (PDT) (envelope-from tom@uniserve.com) Received: from shell.uniserve.ca [204.244.186.218] by pop.uniserve.com with smtp (Exim 1.82 #4) id 0yqg4e-0003qG-00; Mon, 29 Jun 1998 08:42:32 -0700 Date: Mon, 29 Jun 1998 08:42:31 -0700 (PDT) From: Tom X-Sender: tom@shell.uniserve.ca To: Peter Jeremy cc: freebsd-stable@FreeBSD.ORG Subject: Re: determining ecc errors on freebsd-stable In-Reply-To: <199806290401.OAA02134@gsms01.alcatel.com.au> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Mon, 29 Jun 1998, Peter Jeremy wrote: > On Sun, 28 Jun 1998 19:57:26 -0700 (PDT), Tom wrote: > > Basically, if a fixable error occurs, you won't know about it. If an > >unfixable error occurs, you'll know real fast. > > Which substantially reduces the usefulness of ECC. It may increase > the MTBF (since a single-bit failure is now hidden), but it no longer It will increase MTBF a lot, because you know you have a memory problem after one crash and reboot (outage: < 5 minutes), as compared to non-ECC where you will probably have to go through a dozen application crashes and a few system hangs and/or panics. System hangs are the worst (outage: until someone notices and gets down to power cycle the machine). > provides fault tolerance since you can't detect a memory module that > is getting flaky (or has a hard error). Huh? "no longer provides fault tolerance"? How does it provide fault tolerance? If memory fails, something has to die. The FreeBSD approach of simply rebooting is a bit drastic, but at minimum you have to kill whatever process (assuming it isn't the kernel) is occupying that memory. Also, using single bit errrors to detect "flaky but still working modules" doesn't hold much wait with me. Why? Either memory works or it doesn't. "flaky" memory typically is heat triggered, not random. I have a bunch of 24x7 servers with parity memory (which will crash on even single bit errror), and memory failure is rare. Summary: - ECC is MUCH better than non-ECC - Memory failure is rare. FreeBSD still doesn't have multi-path IO to recover from controler card failure, which occurs much more often. Or, clustering which can protect against software failures (which are still much common than any kind of hardware failure). So putting so much emphasis on ECC is unnecessary. > Yet another design engineer for the firing squad... Still waiting for someone's patches to FreeBSD... > Peter > -- > Peter Jeremy (VK2PJ) peter.jeremy@alcatel.com.au > Alcatel Australia Limited > 41 Mandible St Phone: +61 2 9690 5019 > ALEXANDRIA NSW 2015 Fax: +61 2 9690 5247 Tom To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message