Date: Tue, 18 Oct 2011 11:11:34 +0200 From: John Hay <jhay@meraka.org.za> To: freebsd-stable@freebsd.org Subject: Re: MCA: CPU 0 UNCOR PCC DTLB L1 error Message-ID: <20111018091134.GA8700@zibbi.meraka.csir.co.za> In-Reply-To: <20110516165123.GA30171@icarus.home.lan> References: <20110510125220.GA88338@zibbi.meraka.csir.co.za> <BANLkTik79gjQKsdrz_8mQdLc3e9KGiGzzQ@mail.gmail.com> <20110516162319.GA58581@zibbi.meraka.csir.co.za> <20110516165123.GA30171@icarus.home.lan>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Guys, On Mon, May 16, 2011 at 09:51:23AM -0700, Jeremy Chadwick wrote: > On Mon, May 16, 2011 at 06:23:19PM +0200, John Hay wrote: > > On Wed, May 11, 2011 at 05:26:50PM -0500, Alan Cox wrote: > > > On Tue, May 10, 2011 at 7:52 AM, John Hay <jhay@meraka.org.za> wrote: > > > > > > > Hi, > > > > > > > > I have seen this panic a few times on a Gigabyte E350N-USB3 running > > > > 8-STABLE. > > > > I have only seen it while in X, but then the machine is always in X. At > > > > first, > > > > I just got these hangs, so bought a PCI-express RS232 card and could see > > > > these > > > > at last. For some reason it does not go past this, so I have not been able > > > > to > > > > get a dump yet. > > > > > > > > Have anybody an idea of why this is or how to debug it further? I searched > > > > the archives and found something similar about a year ago, but it looks > > > > like it was solved with a fix that got committed. > > > > > > > > http://www.freebsd.org/cgi/query-pr.cgi?pr=140338 > > > > > > > > I have now disabled mca in loader.conf with 'hw.mca.enabled="0"' and I have > > > > not seen that panic again. I do occasionally see a panic in devfs_open(), > > > > but I guess that should be handled in another thread. > > > > > > > > The kernel is basically a GENERIC kernel with puc uncommented and the > > > > following in loader.conf > > > > > > > > vm.kmem_size="12G" > > > > hw.mca.enabled="0" > > > > zfs_load="YES" > > > > ahci_load="YES" > > > > xhci_load="YES" > > > > amdtemp_load="YES" > > > > ng_ubt_load="YES" > > > > uplcom_load="YES" > > > > > > > > Here is the panic message and after that dmesg. > > > > > > > > John > > > > -- > > > > John Hay -- jhay@meraka.csir.co.za / jhay@FreeBSD.org > > > > > > > > #################################################### > > > > MCA: Bank 0, Status 0xb600000000010015 > > > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004 > > > > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0 > > > > MCA: CPU 0 UNCOR PCC DTLB L1 error > > > > MCA: Address 0x8016c4000 > > > > > > > > > > > > Fatal trap 28: machine check trap while in user mode > > > > cpuid = 0; apic id = 00 > > > > instruction pointer = 0x43:0x80156af85 > > > > stack pointer = 0x3b:0x7fffffffcb18 > > > > frame pointer = 0x3b:0x80fe87800 > > > > code segment = base 0x0, limit 0xfffff, type 0x1b > > > > = DPL 3, pres 1, long 1, def32 0, gran 1 > > > > processor eflags = interrupt enabled, IOPL = 0 > > > > current process = 2484 (initial thread) > > > > trap number = 28 > > > > panic: machine check trap > > > > cpuid = 0 > > > > KDB: stack backtrace: > > > > #0 0xffffffff80608d5e at kdb_backtrace+0x5e > > > > #1 0xffffffff805d6707 at panic+0x187 > > > > #2 0xffffffff808bf4c0 at trap_fatal+0x290 > > > > #3 0xffffffff808bfaa9 at trap+0x109 > > > > #4 0xffffffff808a7d94 at calltrap+0x8 > > > > #################################################### > > > > > > > > > > > Please try the following patch: > > > > > > Index: x86/x86/mca.c > > > =================================================================== > > > --- x86/x86/mca.c (revision 219060) > > > +++ x86/x86/mca.c (working copy) > > > @@ -665,7 +665,8 @@ mca_setup(uint64_t mcg_cap) > > > * for Erratum 383. > > > */ > > > if (cpu_vendor_id == CPU_VENDOR_AMD && > > > - CPUID_TO_FAMILY(cpu_id) == 0x10 && amd10h_L1TP) > > > + (CPUID_TO_FAMILY(cpu_id) == 0x10 || > > > + CPUID_TO_FAMILY(cpu_id) == 0x14) && amd10h_L1TP) > > > workaround_erratum383 = 1; > > > > > > mtx_init(&mca_lock, "mca", NULL, MTX_SPIN); > > > Index: i386/i386/pmap.c > > > =================================================================== > > > --- i386/i386/pmap.c (revision 219060) > > > +++ i386/i386/pmap.c (working copy) > > > @@ -758,7 +758,8 @@ pmap_init(void) > > > * machine monitor. > > > */ > > > if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD && > > > - CPUID_TO_FAMILY(cpu_id) == 0x10) > > > + (CPUID_TO_FAMILY(cpu_id) == 0x10 || > > > + CPUID_TO_FAMILY(cpu_id) == 0x14)) > > > workaround_erratum383 = 1; > > > > > > /* > > > Index: amd64/amd64/pmap.c > > > =================================================================== > > > --- amd64/amd64/pmap.c (revision 219060) > > > +++ amd64/amd64/pmap.c (working copy) > > > @@ -727,7 +727,8 @@ pmap_init(void) > > > * machine monitor. > > > */ > > > if (vm_guest == VM_GUEST_VM && cpu_vendor_id == CPU_VENDOR_AMD && > > > - CPUID_TO_FAMILY(cpu_id) == 0x10) > > > + (CPUID_TO_FAMILY(cpu_id) == 0x10 || > > > + CPUID_TO_FAMILY(cpu_id) == 0x14)) > > > workaround_erratum383 = 1; > > > > > > /* > > > > I have applied the patch, but got another one today. I still do not get > > a prompt or dump. :-( It just get stuck right after #4. If there is anything > > more that I can try, just ask. > > > > ##################################################################### > > MCA: Bank 0, Status 0xb600000000010015 > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004 > > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0 > > MCA: CPU 0 UNCOR PCC DTLB L1 error > > MCA: Address 0x808ace000 > > > > > > Fatal trap 28: machine check trap while in user mode > > cpuid = 1; apic id = 01 > > instruction pointer = 0x43:0x80af206d5 > > stack pointer = 0x3b:0x7fffffffb8e8 > > frame pointer = 0x3b:0x809b92450 > > code segment = base 0x0, limit 0xfffff, type 0x1b > > = DPL 3, pres 1, long 1, def32 0, gran 1 > > processor eflags = interrupt enabled, IOPL = 0 > > current process = 22228 (initial thread) > > trap number = 28 > > panic: machine check trap > > cpuid = 1 > > KDB: stack backtrace: > > #0 0xffffffff80608f6e at kdb_backtrace+0x5e > > #1 0xffffffff805d6917 at panic+0x187 > > #2 0xffffffff808bf7c0 at trap_fatal+0x290 > > #3 0xffffffff808bfda9 at trap+0x109 > > #4 0xffffffff808a8084 at calltrap+0x8 > > ##################################################################### > > The backtrace doesn't help in this situation. I'm not sure anyone has > taken the time to explain to you what's going on here exactly. I don't > know if you're like me, but when a machine panics I generally like to > know what's going on. :-) > > Use of MCA (see Wikipedia for Machine Check Architecture) is generating > an MCE (see Wikipedia for Machine Check Exception). MCEs are generated > by hardware when "something happens" -- they usually indicate a > failure (bad RAM, CPU cache failing, etc.). > > Certain MCEs are considered "normal"; for example, L2 cache (on-die in > the CPU) being auto-corrected by ECC (that's ECC on-die, not ECC RAM > like system RAM; this feature is only available on certain classes of > CPUs) may be normal if seen, say, once every few months. A large sum of > them, however, is not normal. > > MCE handling is done in the kernel. Certain MCEs have to be ignored, > and therefore there are handlers for those in the kernel. > > MCEs vary greatly per every model (not class, but model) of CPU. For > example, Intel's documentation on their MCEs is immense and very complex > given all the different CPU models and series'. > > Any MCE without a handler will generate an exception (kernel panic) like > what you see above. This is normal on FreeBSD, as well as Solaris and > many other OSes. It's basically mandatory. The reason being, if the > situation/condition isn't known to be something that can be ignored, the > hardware may be in a state of disarray and cannot be trusted. Hence, > panic. The backtrace will therefore always be very short and indicate > an intentional panic. > > The MCE messages shown in FreeBSD are not very user-friendly, meaning > you can't take what you see and go "omg!!! L1 cache failure!!" because > that's not necessarily what that message means. MCA is complex, and > again, like I said, varies per model of CPU. > > There is a utility on Linux called mcelog that can decode the messages > to some degree. John Baldwin ported this to FreeBSD (it's not in ports) > and I've been occasionally downloading it and ensuring the patches work > correctly + utility compiles and works (I have patches for patches, > basically; no I haven't put them up anywhere). "mcelog --ascii" will > read data from stdin, specifically the messages you see from the kernel, > and it outputs something a little more friendly. > > In your case, however, mcelog does not have support for your specific > model of CPU. Possibly too new? Here's the output that is returned: > > $ ./mcelog --no-dmi --ascii > MCA: Bank 0, Status 0xb600000000010015 > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004 > MCA: Vendor "AuthenticAMD", ID 0x500f10, APIC ID 0 > MCA: CPU 0 UNCOR PCC DTLB L1 error > MCA: Address 0x808ace000 > > mcelog: Unknown CPU type vendor 2 family 14 model 1 > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > CPU 0 BANK 0 > ADDR 808ace000 > STATUS b600000000010015 MCGSTATUS 4 > MCGCAP 106 APICID 0 SOCKETID 0 > CPUID Vendor AMD Family 20 Model 1 > > I'm not familiar with AMD CPUs so I can't really look up what's going on > here or what the MCE indicates, but this information may help others on > this list. > > A workaround -- though risky -- may be to disable MCA entirely by > setting hw.mca.enabled="0" in /boot/loader.conf and rebooting. This > will ensure your system won't panic whenever *any* MCE is seen. Older > FreeBSD defaulted to MCA being off. However, since I don't know what > the MCE indicates, it could be fatal (e.g. panic'ing might be a better > choice). Hard to say at this point. > > Hope this helps educate in one way or another. :-) > Just to say that I have been running this box with hw.mca.enabled="0" in loader.conf and it has been stable since. I do see the ocasional coredump of npviewer.bin, but I see that on other boxes too. So I think that maybe this particular error might be a case where FreeBSD do something in a way that AMD did not expect on these processors. John -- John Hay -- jhay@meraka.csir.co.za / jhay@FreeBSD.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111018091134.GA8700>