Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 1 Oct 2011 03:23:27 -0700
From:      Jeremy Chadwick <freebsd@jdc.parodius.com>
To:        Thomas Zander <thomas.e.zander@googlemail.com>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: Interpreting MCA error output
Message-ID:  <20111001102327.GA37434@icarus.home.lan>
In-Reply-To: <CAFU734y3WsVFTpnGoGfbPH4vVBnoz8f=qGvYS4c%2BLya8PFQP_A@mail.gmail.com>
References:  <CAFU734y3WsVFTpnGoGfbPH4vVBnoz8f=qGvYS4c%2BLya8PFQP_A@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Oct 01, 2011 at 11:08:12AM +0200, Thomas Zander wrote:
> just spotted this MCA event (and subsequently kernel panic):
> 
> Oct  1 10:43:42 marvin kernel: MCA: Bank 4, Status 0xf41b210030080a13
> Oct  1 10:43:42 marvin kernel: MCA: Global Cap 0x0000000000000105,
> Status 0x0000000000000007
> Oct  1 10:43:42 marvin kernel: MCA: Vendor "AuthenticAMD", ID 0x40fb2, APIC ID 0
> Oct  1 10:43:42 marvin kernel: MCA: CPU 0 UNCOR OVER BUSLG Responder RD Memory
> Oct  1 10:43:42 marvin kernel: MCA: Address 0xbfa478a0
> 
> I'd appreciate if somebody helped in interpreting these messages.

Decoded (more on how to do that later):

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 0 4 northbridge
ADDR bfa478a0
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = 3036
       bit40 = error found by scrub
       bit45 = uncorrected ecc error
       bit61 = error uncorrected
       bit62 = error overflow (multiple errors)
  bus error 'local node response, request didn't time out
             generic read mem transaction
             memory access, level generic'
STATUS f41b210030080a13 MCGSTATUS 0
APICID 0 SOCKETID 0
CPUID Vendor AMD Family 15 Model 75
(Fields were incomplete)

Explanation, as I understand it.  First here's a technical ref:

http://www.amd.com/us/Documents/47644A_ecc_embedded.pdf

This is probably the only feature of AMD northbridges and systems I'm
remotely familiar with (I mainly do Intel): your RAM exhibited multiple
multi-bit errors.  For "how much" can be corrected vs. detected, you'll
need to read the technical document (re: ChipKill + Hamming code
combination).

I do not believe FreeBSD has the code to handle ChipKill MCEs
gracefully, so as a result FreeBSD simply panics (this is normal; all
OSes will panic on an MCE they do not know how to handle.  I deal with
this on Solaris 10 at work on a weekly basis, usually due to
cooling/heating problems as a result of bad datacenter cooling).  I know
that on Solaris, some forms of ChipKill can be handled gracefully, which
results in the system ceasing to use certain pages (addressing ranges)
of RAM going forward.  In summary don't be too surprised by the panic.

The "Fields were incomplete" part I'm not sure about; maybe the ASCII
parser expected more data than FreeBSD provides.  Not sure.

So what should you do?  Replace the RAM.  Which DIMM?  Sadly I don't
know how to determine that.  Some system BIOSes (particularly on AMD
systems I've used) let you do memory tests (similar to memtest86) within
the BIOS which can then tell you which DIMM slot experienced a problem.
If yours doesn't have that, I would have to say purchase all new RAM
(yes, all of it) and test the individual DIMMs later so you can
determine which is bad.

Decoding the MCE can be done using Linux's mcelog program -- you'll need
to download the source and apply the patch by hand *and* put in place a
heavily modified version of memstream.c -- which requires a lot of
patching to work on FreeBSD, and can only be used to decode
ASCII-provided MCEs; DMI support does not work.  So, you have to apply
patches then use "mcelog --no-dmi --ascii" and provide the MCE text via
stdin (or use --file).

John Baldwin tends to keep up-to-date patches for mcelog here:

http://people.freebsd.org/~jhb/mcelog/

The last build of mcelog I did on FreeBSD was for mcelog-1.0pre2, which
John's patch (at the time) did not work with.  I made my own patch
(dated 2011/02/11), but it looks like John has since updated his patch.
If you need/want mine, I can put it up on the web.

A few moments ago I tried to download mcelog from the official site, but
ftp.kernel.org is presently returning NXDOMAIN for me (e.g. A record not
found).  The same goes for git.kernel.org.  Great.....

I should really work with John to make mcelog a FreeBSD port and just
regularly update it with patches, etc. to work on FreeBSD.  DMI support
and so on I don't think can be added (at least not by me), but simple
ASCII decoding?  Very possible.

An alternative would be for me to make a CGI version where you could
just go my site and paste in the FreeBSD MCE and it would siphon it
through mcelog and give you the output.

Anyway now I'm rambling, but there ya go.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111001102327.GA37434>