Date: Sat, 1 Oct 2011 03:23:27 -0700 From: Jeremy Chadwick <freebsd@jdc.parodius.com> To: Thomas Zander <thomas.e.zander@googlemail.com> Cc: freebsd-stable <freebsd-stable@freebsd.org> Subject: Re: Interpreting MCA error output Message-ID: <20111001102327.GA37434@icarus.home.lan> In-Reply-To: <CAFU734y3WsVFTpnGoGfbPH4vVBnoz8f=qGvYS4c%2BLya8PFQP_A@mail.gmail.com> References: <CAFU734y3WsVFTpnGoGfbPH4vVBnoz8f=qGvYS4c%2BLya8PFQP_A@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Oct 01, 2011 at 11:08:12AM +0200, Thomas Zander wrote: > just spotted this MCA event (and subsequently kernel panic): > > Oct 1 10:43:42 marvin kernel: MCA: Bank 4, Status 0xf41b210030080a13 > Oct 1 10:43:42 marvin kernel: MCA: Global Cap 0x0000000000000105, > Status 0x0000000000000007 > Oct 1 10:43:42 marvin kernel: MCA: Vendor "AuthenticAMD", ID 0x40fb2, APIC ID 0 > Oct 1 10:43:42 marvin kernel: MCA: CPU 0 UNCOR OVER BUSLG Responder RD Memory > Oct 1 10:43:42 marvin kernel: MCA: Address 0xbfa478a0 > > I'd appreciate if somebody helped in interpreting these messages. Decoded (more on how to do that later): HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 4 northbridge ADDR bfa478a0 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = 3036 bit40 = error found by scrub bit45 = uncorrected ecc error bit61 = error uncorrected bit62 = error overflow (multiple errors) bus error 'local node response, request didn't time out generic read mem transaction memory access, level generic' STATUS f41b210030080a13 MCGSTATUS 0 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 15 Model 75 (Fields were incomplete) Explanation, as I understand it. First here's a technical ref: http://www.amd.com/us/Documents/47644A_ecc_embedded.pdf This is probably the only feature of AMD northbridges and systems I'm remotely familiar with (I mainly do Intel): your RAM exhibited multiple multi-bit errors. For "how much" can be corrected vs. detected, you'll need to read the technical document (re: ChipKill + Hamming code combination). I do not believe FreeBSD has the code to handle ChipKill MCEs gracefully, so as a result FreeBSD simply panics (this is normal; all OSes will panic on an MCE they do not know how to handle. I deal with this on Solaris 10 at work on a weekly basis, usually due to cooling/heating problems as a result of bad datacenter cooling). I know that on Solaris, some forms of ChipKill can be handled gracefully, which results in the system ceasing to use certain pages (addressing ranges) of RAM going forward. In summary don't be too surprised by the panic. The "Fields were incomplete" part I'm not sure about; maybe the ASCII parser expected more data than FreeBSD provides. Not sure. So what should you do? Replace the RAM. Which DIMM? Sadly I don't know how to determine that. Some system BIOSes (particularly on AMD systems I've used) let you do memory tests (similar to memtest86) within the BIOS which can then tell you which DIMM slot experienced a problem. If yours doesn't have that, I would have to say purchase all new RAM (yes, all of it) and test the individual DIMMs later so you can determine which is bad. Decoding the MCE can be done using Linux's mcelog program -- you'll need to download the source and apply the patch by hand *and* put in place a heavily modified version of memstream.c -- which requires a lot of patching to work on FreeBSD, and can only be used to decode ASCII-provided MCEs; DMI support does not work. So, you have to apply patches then use "mcelog --no-dmi --ascii" and provide the MCE text via stdin (or use --file). John Baldwin tends to keep up-to-date patches for mcelog here: http://people.freebsd.org/~jhb/mcelog/ The last build of mcelog I did on FreeBSD was for mcelog-1.0pre2, which John's patch (at the time) did not work with. I made my own patch (dated 2011/02/11), but it looks like John has since updated his patch. If you need/want mine, I can put it up on the web. A few moments ago I tried to download mcelog from the official site, but ftp.kernel.org is presently returning NXDOMAIN for me (e.g. A record not found). The same goes for git.kernel.org. Great..... I should really work with John to make mcelog a FreeBSD port and just regularly update it with patches, etc. to work on FreeBSD. DMI support and so on I don't think can be added (at least not by me), but simple ASCII decoding? Very possible. An alternative would be for me to make a CGI version where you could just go my site and paste in the FreeBSD MCE and it would siphon it through mcelog and give you the output. Anyway now I'm rambling, but there ya go. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, US | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111001102327.GA37434>