Date: Tue, 28 Dec 2010 11:42:32 -0500 From: John Baldwin <jhb@freebsd.org> To: freebsd-stable@freebsd.org Cc: Miroslav Lachman <000.fbsd@quip.cz>, "Matthew D. Fuller" <fullermd@over-yonder.net> Subject: Re: MCA messages after upgrade to 8.2-BEAT1 Message-ID: <201012281142.32654.jhb@freebsd.org> In-Reply-To: <20101224084716.GM94020@over-yonder.net> References: <4D11F1F5.7050902@quip.cz> <201012220957.26854.jhb@freebsd.org> <20101224084716.GM94020@over-yonder.net>
next in thread | previous in thread | raw e-mail | index | archive | help
On Friday, December 24, 2010 3:47:16 am Matthew D. Fuller wrote:
> On Wed, Dec 22, 2010 at 09:57:26AM -0500 I heard the voice of
> John Baldwin, and lo! it spake thus:
> >
> > You are getting corrected ECC errors in your RAM.
>
> Actually, don't
>
> > CPU 0 0 data cache
> > ADDR 236493c0
> > Data cache ECC error (syndrome 1c)
>
> > CPU 0 1 instruction cache
> > ADDR 2a1c9440
> > Instruction cache ECC error
>
> > CPU 0 2 bus unit
> > L2 cache ECC error
>
> > CPU 1 0 data cache
> > ADDR 23649640
> > Data cache ECC error (syndrome 1c)
>
> > CPU 1 1 instruction cache
> > ADDR 2a1c9440
> > Instruction cache ECC error
>
> > CPU 1 2 bus unit
> > L2 cache ECC error
>
> suggest CPU cache, not RAM?
>
> (that's actually a question; I don't know, but that's what a naive
> reading suggests...)
Hmm, I don't know for certain. My interpretation is that the CPU errors were
just secondary errors from a memory error like this one that was in the middle
of his reported errors. It was also only reported on CPU 0 and not CPU 1:
STATUS d000400000000863 MCGSTATUS 0
MCGCAP 105 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 15 Model 67
HARDWARE ERROR. This is NOT a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge
MISC e00d0fff00000000 ADDR 2cac9678
Northbridge RAM ECC error
ECC syndrome = 1c
bit33 = err cpu1
bit46 = corrected ecc error
bit59 = misc error valid
bit62 = error overflow (multiple errors)
bus error 'local node origin, request didn't time out
generic read mem transaction
memory access, level generic'
On Intel systems (which I am much more familiar with as far as machine checks
go), corrected ECC errors did not result in additional events in the CPU
caches themselves, but I don't know if AMD is different in this regard. It
could be that both CPUs and a DIMM are failing, but replacing a DIMM is
cheaper and simpler and you can always replace the CPUs later if CPU errors
continue. Of course, I can't tell you which DIMM to replace from these
messages, but in this case since they are so easily reproducible, you could
probably swap them out one at a time to test.
--
John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?201012281142.32654.jhb>
