Date: Wed, 25 Aug 2010 08:25:34 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-stable@freebsd.org Cc: Andriy Gapon <avg@icyb.net.ua>, Jeremy Chadwick <freebsd@jdc.parodius.com>, Dan Langille <dan@langille.org> Subject: Re: kernel MCA messages Message-ID: <201008250825.34903.jhb@freebsd.org> In-Reply-To: <4C74F7FF.8000704@icyb.net.ua> References: <4C71CC62.6060803@langille.org> <4C74F36B.2060200@langille.org> <4C74F7FF.8000704@icyb.net.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote: > on 25/08/2010 13:41 Dan Langille said the following: > > On 8/25/2010 3:11 AM, Andriy Gapon wrote: > > > >> Have you read the decoded message? > >> Please re-read it. > >> > >> I still recommend reading at least the summary of the RAM ECC research article > >> to make your own judgment about need to replace DRAM. > > > > Andriy: What is your interpretation of the decoded message? What is your view on > > replacing DRAM? What do you conclude from the summary? > > Most likely you have a small defect in one of your memory modules, perhaps a > "stuck" bit. You will be getting correctable ECC errors for that module. > Eventually you might get uncorrectable error. That may happen soon or it may > never happen during lifetime of that modules. > > As that study has demonstrated a significant percentage of systems and modules > report at least one correctable ECC error. ECC correctable errors at present > correlate with correctable ECC errors in the future. They also correlate with > uncorrectable errors in the future. But percentage of systems developing > uncorrectable errors is significantly smaller, so chances of false positives are > substantial. > > You should decide whether you want to replace the module (if you can pinpoint it) > or all modules depending on your resources (money, etc), importance of service > that the server in question provides (allowable downtime and cost of it and > fault-tolerance of a larger system, of which the server may be a part (e.g. it may > have a standby server for failover). > > I think that most of what I've just said was kind of obvious from the start. > The important bit from that study is that ECC errors are not as random and as rare > as was previously thought, and they can be attributed to a number of factors like > manufacturing defects, layout of memory lanes on motherboard, etc. A while back I found a slide deck from some Intel presentation that claimed that a modern 4GB DIMM should average 18 corrected errors a month. Your rate is a bit higher than that, but corrected ECC errors are not that unexpected. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201008250825.34903.jhb>