From owner-freebsd-stable@FreeBSD.ORG Wed Aug 25 11:01:26 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 8549F10656A6 for ; Wed, 25 Aug 2010 11:01:26 +0000 (UTC) (envelope-from avg@icyb.net.ua) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id C8ED28FC13 for ; Wed, 25 Aug 2010 11:01:25 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id OAA29560; Wed, 25 Aug 2010 14:01:19 +0300 (EEST) (envelope-from avg@icyb.net.ua) Message-ID: <4C74F7FF.8000704@icyb.net.ua> Date: Wed, 25 Aug 2010 14:01:19 +0300 From: Andriy Gapon User-Agent: Mozilla/5.0 (X11; U; FreeBSD amd64; en-US; rv:1.9.2.8) Gecko/20100823 Lightning/1.0b2 Thunderbird/3.1.2 MIME-Version: 1.0 To: Dan Langille References: <4C71CC62.6060803@langille.org> <4C745213.3050004@langille.org> <20100824233849.GA35100@icarus.home.lan> <4C74C221.5020702@icyb.net.ua> <4C74F36B.2060200@langille.org> In-Reply-To: <4C74F36B.2060200@langille.org> X-Enigmail-Version: 1.1.2 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-stable , Jeremy Chadwick Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 11:01:26 -0000 on 25/08/2010 13:41 Dan Langille said the following: > On 8/25/2010 3:11 AM, Andriy Gapon wrote: > >> Have you read the decoded message? >> Please re-read it. >> >> I still recommend reading at least the summary of the RAM ECC research article >> to make your own judgment about need to replace DRAM. > > Andriy: What is your interpretation of the decoded message? What is your view on > replacing DRAM? What do you conclude from the summary? Most likely you have a small defect in one of your memory modules, perhaps a "stuck" bit. You will be getting correctable ECC errors for that module. Eventually you might get uncorrectable error. That may happen soon or it may never happen during lifetime of that modules. As that study has demonstrated a significant percentage of systems and modules report at least one correctable ECC error. ECC correctable errors at present correlate with correctable ECC errors in the future. They also correlate with uncorrectable errors in the future. But percentage of systems developing uncorrectable errors is significantly smaller, so chances of false positives are substantial. You should decide whether you want to replace the module (if you can pinpoint it) or all modules depending on your resources (money, etc), importance of service that the server in question provides (allowable downtime and cost of it and fault-tolerance of a larger system, of which the server may be a part (e.g. it may have a standby server for failover). I think that most of what I've just said was kind of obvious from the start. The important bit from that study is that ECC errors are not as random and as rare as was previously thought, and they can be attributed to a number of factors like manufacturing defects, layout of memory lanes on motherboard, etc. -- Andriy Gapon