Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Jun 2006 10:41:55 +1000
From:      Andrew Reilly <andrew-freebsd@areilly.bpc-users.org>
To:        "M.Hirsch" <webmaster@hirsch.it>
Cc:        Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>, freebsd-stable@freebsd.org
Subject:   Re: FreeBSD 6.x CVSUP today crashes with zero load ...
Message-ID:  <20060627004155.GG92989@duncan.reilly.home>
In-Reply-To: <44A06FFB.40104@hirsch.it>
References:  <20060626081029.L1114@ganymede.hub.org> <20060626140333.M38418@fledge.watson.org> <20060626235355.Q95667@atlantis.atlantis.dp.ua> <44A04FD2.1030001@hirsch.it> <20060627011512.N95667@atlantis.atlantis.dp.ua> <44A06233.1090704@hirsch.it> <20060627014335.E87535@atlantis.atlantis.dp.ua> <44A068A7.3090403@hirsch.it> <20060627020819.L3403@atlantis.atlantis.dp.ua> <44A06FFB.40104@hirsch.it>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jun 27, 2006 at 01:38:35AM +0200, M.Hirsch wrote:
> I just would like you (not specifically you, Dmitry) to aknowledge that 
> broken RAM is worth a "panic" in "standard situations"- if I may call it 
> like that.

Well, ideally, if broken ram could be isolated with something
like IBM's chipkill stuff, then that would be better than
panicing.  Sort of like enabling hot-swap of failing disk
drives.

The point that's been made, though, is that "soft" errors aren't
necessarily (or even) hardware failures at all.  Hardware
failures can look like persistent soft errors, but soft errors
are real: radiation induced bit-flippage happens.  ECC
turns what would otherwise be a panic-inducing error state into
a total non-event, improving the uptime of very large memory
systems to useful levels.  Exactly similar to the forward error
correction used on disk drives and communications channels.  In
all of these systems, the technology has been pushed so close to
the limits that the difference between "signal" and "noise" can
only be determined by sophisticated statistical analysis and
systematic redundancy.

> If the RAM is broken for some bits, chances are great that there are 
> more following soon.
> ... from the replies I got via PM, I feel some people don't agree with 
> that....

A single corrected error just isn't an indication that the
hardware is broken.  If the ECC scrubber can't flip the bit to
the right state, *then* the hardware is broken, and you do need
to panic.

-- 
Andrew



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060627004155.GG92989>