From owner-freebsd-stable@FreeBSD.ORG  Tue Jun 27 00:42:03 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id C353A16A404
	for <freebsd-stable@freebsd.org>; Tue, 27 Jun 2006 00:42:03 +0000 (UTC)
	(envelope-from andrew@areilly.bpc-users.org)
Received: from omta05sl.mx.bigpond.com (omta05sl.mx.bigpond.com
	[144.140.93.195])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 5BCB943D6B
	for <freebsd-stable@freebsd.org>; Tue, 27 Jun 2006 00:41:52 +0000 (GMT)
	(envelope-from andrew@areilly.bpc-users.org)
Received: from areilly.bpc-users.org ([141.168.7.22])
	by omta05sl.mx.bigpond.com with ESMTP id
	<20060627004151.DPWQ17036.omta05sl.mx.bigpond.com@areilly.bpc-users.org>
	for <freebsd-stable@freebsd.org>; Tue, 27 Jun 2006 00:41:51 +0000
Received: (qmail 2281 invoked by uid 501); 27 Jun 2006 00:41:55 -0000
Date: Tue, 27 Jun 2006 10:41:55 +1000
From: Andrew Reilly <andrew-freebsd@areilly.bpc-users.org>
To: "M.Hirsch" <webmaster@hirsch.it>
Message-ID: <20060627004155.GG92989@duncan.reilly.home>
References: <20060626081029.L1114@ganymede.hub.org>
	<20060626140333.M38418@fledge.watson.org>
	<20060626235355.Q95667@atlantis.atlantis.dp.ua>
	<44A04FD2.1030001@hirsch.it>
	<20060627011512.N95667@atlantis.atlantis.dp.ua>
	<44A06233.1090704@hirsch.it>
	<20060627014335.E87535@atlantis.atlantis.dp.ua>
	<44A068A7.3090403@hirsch.it>
	<20060627020819.L3403@atlantis.atlantis.dp.ua>
	<44A06FFB.40104@hirsch.it>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <44A06FFB.40104@hirsch.it>
User-Agent: Mutt/1.4.2.1i
Cc: Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>, freebsd-stable@freebsd.org
Subject: Re: FreeBSD 6.x CVSUP today crashes with zero load ...
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 27 Jun 2006 00:42:03 -0000

On Tue, Jun 27, 2006 at 01:38:35AM +0200, M.Hirsch wrote:
> I just would like you (not specifically you, Dmitry) to aknowledge that 
> broken RAM is worth a "panic" in "standard situations"- if I may call it 
> like that.

Well, ideally, if broken ram could be isolated with something
like IBM's chipkill stuff, then that would be better than
panicing.  Sort of like enabling hot-swap of failing disk
drives.

The point that's been made, though, is that "soft" errors aren't
necessarily (or even) hardware failures at all.  Hardware
failures can look like persistent soft errors, but soft errors
are real: radiation induced bit-flippage happens.  ECC
turns what would otherwise be a panic-inducing error state into
a total non-event, improving the uptime of very large memory
systems to useful levels.  Exactly similar to the forward error
correction used on disk drives and communications channels.  In
all of these systems, the technology has been pushed so close to
the limits that the difference between "signal" and "noise" can
only be determined by sophisticated statistical analysis and
systematic redundancy.

> If the RAM is broken for some bits, chances are great that there are 
> more following soon.
> ... from the replies I got via PM, I feel some people don't agree with 
> that....

A single corrected error just isn't an indication that the
hardware is broken.  If the ECC scrubber can't flip the bit to
the right state, *then* the hardware is broken, and you do need
to panic.

-- 
Andrew