From owner-freebsd-stable@FreeBSD.ORG Wed Aug 25 12:27:45 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BBDF9106564A for ; Wed, 25 Aug 2010 12:27:45 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 8A77A8FC1C for ; Wed, 25 Aug 2010 12:27:45 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 318EB46B89; Wed, 25 Aug 2010 08:27:45 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id CF4E78A04E; Wed, 25 Aug 2010 08:27:43 -0400 (EDT) From: John Baldwin To: freebsd-stable@freebsd.org Date: Wed, 25 Aug 2010 08:25:34 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <4C71CC62.6060803@langille.org> <4C74F36B.2060200@langille.org> <4C74F7FF.8000704@icyb.net.ua> In-Reply-To: <4C74F7FF.8000704@icyb.net.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201008250825.34903.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Wed, 25 Aug 2010 08:27:43 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Andriy Gapon , Jeremy Chadwick , Dan Langille Subject: Re: kernel MCA messages X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Aug 2010 12:27:45 -0000 On Wednesday, August 25, 2010 7:01:19 am Andriy Gapon wrote: > on 25/08/2010 13:41 Dan Langille said the following: > > On 8/25/2010 3:11 AM, Andriy Gapon wrote: > > > >> Have you read the decoded message? > >> Please re-read it. > >> > >> I still recommend reading at least the summary of the RAM ECC research article > >> to make your own judgment about need to replace DRAM. > > > > Andriy: What is your interpretation of the decoded message? What is your view on > > replacing DRAM? What do you conclude from the summary? > > Most likely you have a small defect in one of your memory modules, perhaps a > "stuck" bit. You will be getting correctable ECC errors for that module. > Eventually you might get uncorrectable error. That may happen soon or it may > never happen during lifetime of that modules. > > As that study has demonstrated a significant percentage of systems and modules > report at least one correctable ECC error. ECC correctable errors at present > correlate with correctable ECC errors in the future. They also correlate with > uncorrectable errors in the future. But percentage of systems developing > uncorrectable errors is significantly smaller, so chances of false positives are > substantial. > > You should decide whether you want to replace the module (if you can pinpoint it) > or all modules depending on your resources (money, etc), importance of service > that the server in question provides (allowable downtime and cost of it and > fault-tolerance of a larger system, of which the server may be a part (e.g. it may > have a standby server for failover). > > I think that most of what I've just said was kind of obvious from the start. > The important bit from that study is that ECC errors are not as random and as rare > as was previously thought, and they can be attributed to a number of factors like > manufacturing defects, layout of memory lanes on motherboard, etc. A while back I found a slide deck from some Intel presentation that claimed that a modern 4GB DIMM should average 18 corrected errors a month. Your rate is a bit higher than that, but corrected ECC errors are not that unexpected. -- John Baldwin