Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Aug 2010 14:01:19 +0300
From:      Andriy Gapon <avg@icyb.net.ua>
To:        Dan Langille <dan@langille.org>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>, Jeremy Chadwick <freebsd@jdc.parodius.com>
Subject:   Re: kernel MCA messages
Message-ID:  <4C74F7FF.8000704@icyb.net.ua>
In-Reply-To: <4C74F36B.2060200@langille.org>
References:  <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org>	<20100824233849.GA35100@icarus.home.lan> <4C74C221.5020702@icyb.net.ua> <4C74F36B.2060200@langille.org>

next in thread | previous in thread | raw e-mail | index | archive | help
on 25/08/2010 13:41 Dan Langille said the following:
> On 8/25/2010 3:11 AM, Andriy Gapon wrote:
> 
>> Have you read the decoded message?
>> Please re-read it.
>>
>> I still recommend reading at least the summary of the RAM ECC research article
>> to make your own judgment about need to replace DRAM.
> 
> Andriy: What is your interpretation of the decoded message?  What is your view on
> replacing DRAM?  What do you conclude from the summary?

Most likely you have a small defect in one of your memory modules, perhaps a
"stuck" bit.  You will be getting correctable ECC errors for that module.
Eventually you might get uncorrectable error.  That may happen soon or it may
never happen during lifetime of that modules.

As that study has demonstrated a significant percentage of systems and modules
report at least one correctable ECC error.  ECC correctable errors at present
correlate with correctable ECC errors in the future.  They also correlate with
uncorrectable errors in the future.  But percentage of systems developing
uncorrectable errors is significantly smaller, so chances of false positives are
substantial.

You should decide whether you want to replace the module (if you can pinpoint it)
or all modules depending on your resources (money, etc), importance of service
that the server in question provides (allowable downtime and cost of it and
fault-tolerance of a larger system, of which the server may be a part (e.g. it may
have a standby server for failover).

I think that most of what I've just said was kind of obvious from the start.
The important bit from that study is that ECC errors are not as random and as rare
as was previously thought, and they can be attributed to a number of factors like
manufacturing defects, layout of memory lanes on motherboard, etc.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C74F7FF.8000704>