Date: Wed, 25 Aug 2010 14:01:19 +0300 From: Andriy Gapon <avg@icyb.net.ua> To: Dan Langille <dan@langille.org> Cc: freebsd-stable <freebsd-stable@freebsd.org>, Jeremy Chadwick <freebsd@jdc.parodius.com> Subject: Re: kernel MCA messages Message-ID: <4C74F7FF.8000704@icyb.net.ua> In-Reply-To: <4C74F36B.2060200@langille.org> References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan> <4C74C221.5020702@icyb.net.ua> <4C74F36B.2060200@langille.org>
next in thread | previous in thread | raw e-mail | index | archive | help
on 25/08/2010 13:41 Dan Langille said the following: > On 8/25/2010 3:11 AM, Andriy Gapon wrote: > >> Have you read the decoded message? >> Please re-read it. >> >> I still recommend reading at least the summary of the RAM ECC research article >> to make your own judgment about need to replace DRAM. > > Andriy: What is your interpretation of the decoded message? What is your view on > replacing DRAM? What do you conclude from the summary? Most likely you have a small defect in one of your memory modules, perhaps a "stuck" bit. You will be getting correctable ECC errors for that module. Eventually you might get uncorrectable error. That may happen soon or it may never happen during lifetime of that modules. As that study has demonstrated a significant percentage of systems and modules report at least one correctable ECC error. ECC correctable errors at present correlate with correctable ECC errors in the future. They also correlate with uncorrectable errors in the future. But percentage of systems developing uncorrectable errors is significantly smaller, so chances of false positives are substantial. You should decide whether you want to replace the module (if you can pinpoint it) or all modules depending on your resources (money, etc), importance of service that the server in question provides (allowable downtime and cost of it and fault-tolerance of a larger system, of which the server may be a part (e.g. it may have a standby server for failover). I think that most of what I've just said was kind of obvious from the start. The important bit from that study is that ECC errors are not as random and as rare as was previously thought, and they can be attributed to a number of factors like manufacturing defects, layout of memory lanes on motherboard, etc. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C74F7FF.8000704>