From owner-freebsd-stable@FreeBSD.ORG Tue Jun 27 00:42:03 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id C353A16A404 for ; Tue, 27 Jun 2006 00:42:03 +0000 (UTC) (envelope-from andrew@areilly.bpc-users.org) Received: from omta05sl.mx.bigpond.com (omta05sl.mx.bigpond.com [144.140.93.195]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5BCB943D6B for ; Tue, 27 Jun 2006 00:41:52 +0000 (GMT) (envelope-from andrew@areilly.bpc-users.org) Received: from areilly.bpc-users.org ([141.168.7.22]) by omta05sl.mx.bigpond.com with ESMTP id <20060627004151.DPWQ17036.omta05sl.mx.bigpond.com@areilly.bpc-users.org> for ; Tue, 27 Jun 2006 00:41:51 +0000 Received: (qmail 2281 invoked by uid 501); 27 Jun 2006 00:41:55 -0000 Date: Tue, 27 Jun 2006 10:41:55 +1000 From: Andrew Reilly To: "M.Hirsch" Message-ID: <20060627004155.GG92989@duncan.reilly.home> References: <20060626081029.L1114@ganymede.hub.org> <20060626140333.M38418@fledge.watson.org> <20060626235355.Q95667@atlantis.atlantis.dp.ua> <44A04FD2.1030001@hirsch.it> <20060627011512.N95667@atlantis.atlantis.dp.ua> <44A06233.1090704@hirsch.it> <20060627014335.E87535@atlantis.atlantis.dp.ua> <44A068A7.3090403@hirsch.it> <20060627020819.L3403@atlantis.atlantis.dp.ua> <44A06FFB.40104@hirsch.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <44A06FFB.40104@hirsch.it> User-Agent: Mutt/1.4.2.1i Cc: Dmitry Pryanishnikov , freebsd-stable@freebsd.org Subject: Re: FreeBSD 6.x CVSUP today crashes with zero load ... X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 27 Jun 2006 00:42:03 -0000 On Tue, Jun 27, 2006 at 01:38:35AM +0200, M.Hirsch wrote: > I just would like you (not specifically you, Dmitry) to aknowledge that > broken RAM is worth a "panic" in "standard situations"- if I may call it > like that. Well, ideally, if broken ram could be isolated with something like IBM's chipkill stuff, then that would be better than panicing. Sort of like enabling hot-swap of failing disk drives. The point that's been made, though, is that "soft" errors aren't necessarily (or even) hardware failures at all. Hardware failures can look like persistent soft errors, but soft errors are real: radiation induced bit-flippage happens. ECC turns what would otherwise be a panic-inducing error state into a total non-event, improving the uptime of very large memory systems to useful levels. Exactly similar to the forward error correction used on disk drives and communications channels. In all of these systems, the technology has been pushed so close to the limits that the difference between "signal" and "noise" can only be determined by sophisticated statistical analysis and systematic redundancy. > If the RAM is broken for some bits, chances are great that there are > more following soon. > ... from the replies I got via PM, I feel some people don't agree with > that.... A single corrected error just isn't an indication that the hardware is broken. If the ECC scrubber can't flip the bit to the right state, *then* the hardware is broken, and you do need to panic. -- Andrew