Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 27 Jun 2006 01:07:19 +0200
From:      "M.Hirsch" <webmaster@hirsch.it>
To:        Dmitry Pryanishnikov <dmitry@atlantis.dp.ua>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: FreeBSD 6.x CVSUP today crashes with zero load ...
Message-ID:  <44A068A7.3090403@hirsch.it>
In-Reply-To: <20060627014335.E87535@atlantis.atlantis.dp.ua>
References:  <E1FuYsL-000HT3-H2@dilbert.firstcallgroup.co.uk> <20060626100949.G24406@fledge.watson.org> <20060626081029.L1114@ganymede.hub.org> <20060626140333.M38418@fledge.watson.org> <20060626235355.Q95667@atlantis.atlantis.dp.ua> <44A04FD2.1030001@hirsch.it> <20060627011512.N95667@atlantis.atlantis.dp.ua> <44A06233.1090704@hirsch.it> <20060627014335.E87535@atlantis.atlantis.dp.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Dmitry Pryanishnikov schrieb:

> When you wrote "ECC is a way to mask broken hardware", you were plain 
> wrong.
> If you're using hardware w/o ECC, it just can't tell whether error 
> present
> or absent. So ECC _is_ the way to detect (not mask) broken hardware.
>
Ok, thanks. I think I understand the meaning of ECC now.
So, unlike my supplier claims, ECC is not supposed to help against 
hardware failures.
But it is the way to detect them, right?

>  If you want ECC corrector to raise NMI on corrected error (as well as 
> uncorrectable), just set approproate bit in control register - every
> Intel's ECC-capable chipset allows it. But if we're speaking about
> production environment, such behaviour (abnormal termination on 
> _corrected_
> error) is unacceptable.

"abnormal termination" is not only acceptable for me, it is what I am 
looking for.
Make the node crash completely, so one of the others can take over its 
task(s).

> Don't get me wrong, but tracking bugs in FreeBSD is quite more of an 
> effort than "just" akquiring a new box...
>
>  I don't see connection between this sentence and ECC (which is 
> hardware option).

What I wanted to say:
Looking for errors in the logs is only a few seconds.
Finding out what caused them, is hours...
Akquiring a new box is only $29,95 ;) - that's like 30 minutes, if you 
regard it from the business side. ... I rather rent 100 boxes to do the 
task of ten, than employ 100 admins to find the "real" problem.

Thanks, Dmitry. I think I know what to look for now...

M.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?44A068A7.3090403>