Date: Tue, 28 Dec 2010 11:42:32 -0500 From: John Baldwin <jhb@freebsd.org> To: freebsd-stable@freebsd.org Cc: Miroslav Lachman <000.fbsd@quip.cz>, "Matthew D. Fuller" <fullermd@over-yonder.net> Subject: Re: MCA messages after upgrade to 8.2-BEAT1 Message-ID: <201012281142.32654.jhb@freebsd.org> In-Reply-To: <20101224084716.GM94020@over-yonder.net> References: <4D11F1F5.7050902@quip.cz> <201012220957.26854.jhb@freebsd.org> <20101224084716.GM94020@over-yonder.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, December 24, 2010 3:47:16 am Matthew D. Fuller wrote: > On Wed, Dec 22, 2010 at 09:57:26AM -0500 I heard the voice of > John Baldwin, and lo! it spake thus: > > > > You are getting corrected ECC errors in your RAM. > > Actually, don't > > > CPU 0 0 data cache > > ADDR 236493c0 > > Data cache ECC error (syndrome 1c) > > > CPU 0 1 instruction cache > > ADDR 2a1c9440 > > Instruction cache ECC error > > > CPU 0 2 bus unit > > L2 cache ECC error > > > CPU 1 0 data cache > > ADDR 23649640 > > Data cache ECC error (syndrome 1c) > > > CPU 1 1 instruction cache > > ADDR 2a1c9440 > > Instruction cache ECC error > > > CPU 1 2 bus unit > > L2 cache ECC error > > suggest CPU cache, not RAM? > > (that's actually a question; I don't know, but that's what a naive > reading suggests...) Hmm, I don't know for certain. My interpretation is that the CPU errors were just secondary errors from a memory error like this one that was in the middle of his reported errors. It was also only reported on CPU 0 and not CPU 1: STATUS d000400000000863 MCGSTATUS 0 MCGCAP 105 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 67 HARDWARE ERROR. This is NOT a software problem! Please contact your hardware vendor CPU 0 4 northbridge MISC e00d0fff00000000 ADDR 2cac9678 Northbridge RAM ECC error ECC syndrome = 1c bit33 = err cpu1 bit46 = corrected ecc error bit59 = misc error valid bit62 = error overflow (multiple errors) bus error 'local node origin, request didn't time out generic read mem transaction memory access, level generic' On Intel systems (which I am much more familiar with as far as machine checks go), corrected ECC errors did not result in additional events in the CPU caches themselves, but I don't know if AMD is different in this regard. It could be that both CPUs and a DIMM are failing, but replacing a DIMM is cheaper and simpler and you can always replace the CPUs later if CPU errors continue. Of course, I can't tell you which DIMM to replace from these messages, but in this case since they are so easily reproducible, you could probably swap them out one at a time to test. -- John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201012281142.32654.jhb>