Date: Sat, 17 Jan 2015 13:46:53 -0800 From: Jeremy Chadwick <jdc@koitsu.org> To: Matthias Apitz <guru@unixarea.de>, Eric van Gyzen <eric@vangyzen.net>, freebsd-current@freebsd.org Subject: Re: kernel: MCA: CPU 0 COR (1) internal parity error Message-ID: <20150117214653.GA65337@icarus.home.lan> In-Reply-To: <20150117174326.GA40205@c720-r276659> References: <20150116194539.GA2230@c720-r276659> <54B96EE4.5090702@vangyzen.net> <20150117174326.GA40205@c720-r276659>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote: > El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió: > > > On 01/16/2015 14:45, Matthias Apitz wrote: > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > > > Try ports/sysutils/mcelog. > > I have installed that port and launched it as > > # mcelog > mcelog.txt > ... > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > ... > > (the messages are STDERR); > > in 'mcelog.txt' it has for the last event from /var/log/messages: > > Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > the following lines (the uptime matches): > > ... > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > MCE 32 > CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)] > MCG status: > MCi status: > Error enabled > MCA: Unknown Error 5 > STATUS 90000040000f0005 MCGSTATUS 0 > MCGCAP c07 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 69 > > Questions: > a) Is the output of mcelog valid (regardless of the msg on STDERR of > 'unsupported model')? It may or may not be reliable. For MCE decoding to work accurately, the software (read: kernel) needs to have full support for the processor model and revision in question. mcelog simply tries to decode the output that the kernel spits out and provide a more "user-friendly" explanation. That isn't as simple as just modifying some table of supported CPUs; it involves reading Intel documentation and implementing what can be figured out through that. VMware has a small KB about this, to give you some insight into the complexity: http://kb.vmware.com/kb/1005184 There are some capabilities of MCA that are "semi-universal" across series of CPUs, so sometimes those can be decoded (mostly) accurately, but other times such isn't the case. Sometimes there are certain MCEs that have be ignored by the kernel (i.e. the kernel MCE support has to be updated to reflect changes in MCEs for that newer model of processor). The version of mcelog available in ports is extremely old, and the amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine would be quite large: http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git The existing FreeBSD port involves a large number of patches written by John Baldwin, and whether or not those can be correctly backported to newer mcelog releases is unknown. I really need to renounce my maintainer flag of that port and let someone else take care of it. > b) Is it worth to contact the dealer or wait until it is broken > completely? To me, the above message indicates that one of the CPU cores is damaged/misbehaving. I cannot determine if it's referring to L1, L2, or L3 cache, but I don't see any clear indicator of that (possibly due to the aforementioned explanation I gave about accuracy). However, I will point you to this thread, which may indicate that the model of CPU in question (or series or models of Intel CPUs) have MCEs that happen which are considered "normal" and are thus not being decoded correctly: https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html I would suggest providing relevant dmesg lines about your exact processor in this system and possibly ask for help from either John Baldwin or someone on freebsd-hackers@. I myself cannot help with this. The dmesg lines I'm referring to, by the way, look like this (all of them matter, particularly the first two): CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (2833.59-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10677 Family = 0x6 Model = 0x17 Stepping = 7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x8e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1> AMD Features=0x20100800<SYSCALL,NX,LM> AMD Features2=0x1<LAHF> TSC: P-state invariant, performance statistics The OP of that freebsd-questions thread should have provided this but didn't (instead just says "Intel i3-4310" -- this isn't precise enough), so whether or not you two are using the same CPU is unknown. There simply could be "new MCEs" or changes to the MCA that Intel implemented in some newer models of Core iX that aren't being handled correctly by the kernel (i.e. misreporting or mis-decoding). Good luck! -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB |
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150117214653.GA65337>