From owner-freebsd-stable@FreeBSD.ORG Tue Dec 28 16:44:23 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06B8D1065714 for ; Tue, 28 Dec 2010 16:44:23 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id CC66B8FC1B for ; Tue, 28 Dec 2010 16:44:22 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 85EC946B1A; Tue, 28 Dec 2010 11:44:22 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id A25DC8A009; Tue, 28 Dec 2010 11:44:21 -0500 (EST) From: John Baldwin To: freebsd-stable@freebsd.org Date: Tue, 28 Dec 2010 11:42:32 -0500 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20101102; KDE/4.4.5; amd64; ; ) References: <4D11F1F5.7050902@quip.cz> <201012220957.26854.jhb@freebsd.org> <20101224084716.GM94020@over-yonder.net> In-Reply-To: <20101224084716.GM94020@over-yonder.net> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201012281142.32654.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Tue, 28 Dec 2010 11:44:21 -0500 (EST) X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.9 required=4.2 tests=BAYES_00 autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx Cc: Miroslav Lachman <000.fbsd@quip.cz>, "Matthew D. Fuller" Subject: Re: MCA messages after upgrade to 8.2-BEAT1 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Dec 2010 16:44:23 -0000 On Friday, December 24, 2010 3:47:16 am Matthew D. Fuller wrote: > On Wed, Dec 22, 2010 at 09:57:26AM -0500 I heard the voice of > John Baldwin, and lo! it spake thus: > > > > You are getting corrected ECC errors in your RAM. > > Actually, don't > > > CPU 0 0 data cache > > ADDR 236493c0 > > Data cache ECC error (syndrome 1c) > > > CPU 0 1 instruction cache > > ADDR 2a1c9440 > > Instruction cache ECC error > > > CPU 0 2 bus unit > > L2 cache ECC error > > > CPU 1 0 data cache > > ADDR 23649640 > > Data cache ECC error (syndrome 1c) > > > CPU 1 1 instruction cache > > ADDR 2a1c9440 > > Instruction cache ECC error > > > CPU 1 2 bus unit > > L2 cache ECC error > > suggest CPU cache, not RAM? > > (that's actually a question; I don't know, but that's what a naive > reading suggests...) Hmm, I don't know for certain. My interpretation is that the CPU errors were just secondary errors from a memory error like this one that was in the middle of his reported errors. It was also only reported on CPU 0 and not CPU 1: STATUS d000400000000863 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is NOT a software problem! Please contact your hardware vendor CPU 0 4 northbridge MISC e00d0fff00000000 ADDR 2cac9678 Northbridge RAM ECC error ECC syndrome = 1c bit33 = err cpu1 bit46 = corrected ecc error bit59 = misc error valid bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' On Intel systems (which I am much more familiar with as far as machine checks go), corrected ECC errors did not result in additional events in the CPU caches themselves, but I don't know if AMD is different in this regard. It could be that both CPUs and a DIMM are failing, but replacing a DIMM is cheaper and simpler and you can always replace the CPUs later if CPU errors continue. Of course, I can't tell you which DIMM to replace from these messages, but in this case since they are so easily reproducible, you could probably swap them out one at a time to test. -- John Baldwin