Date: Sun, 18 Jan 2015 07:08:43 +0100 From: Matthias Apitz <guru@unixarea.de> To: freebsd-hackers@freebsd.org Cc: Jeremy Chadwick <jdc@koitsu.org> Subject: Fwd: kernel: MCA: CPU 0 COR (1) internal parity error Message-ID: <20150118060843.GA1184@c720-r276659>
next in thread | raw e-mail | index | archive | help
Hello, I'm running since some days a recent -HEAD r276659 on an Acer C720 Chromebook which works very nicely and fast (I really have never seen such a fast KDE4 desktop). >From time to time (let's say 2-3 times a day) I see messages like this in /var/log/messages: Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error the kernel is: # uname -a FreeBSD c720-r276659 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan 6 12:55:25 CET 2015 guru@vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386 i.e. the i386 version (because I compile everything, kernel and ports, in a VMbox) I'm attaching below the complete 'dmesg' lines with the information details about the CPU. I raised questions about these MCA messages in freebsd-current@ and was pointed to a tool in ports/sysutils/mcelog. Jeremy Chadwick <jdc@koitsu.org> the maintainer of mcelog, made hints about the issue, see below, and asked me to bring this up in freebsd-hackers@ Are these messages really a hardware problem or do our kernel misreporting or mis-decoding of some hardware information. Despite of the messages, the system does not show any other faults or PANICs. Thanks matthias ----- Forwarded message from Jeremy Chadwick <jdc@koitsu.org> ----- Date: Sat, 17 Jan 2015 13:46:53 -0800 From: Jeremy Chadwick <jdc@koitsu.org> To: Matthias Apitz <guru@unixarea.de>, Eric van Gyzen <eric@vangyzen.net>, freebsd-current@freebsd.org Subject: Re: kernel: MCA: CPU 0 COR (1) internal parity error On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote: > El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió: > > > On 01/16/2015 14:45, Matthias Apitz wrote: > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > > > Try ports/sysutils/mcelog. > > I have installed that port and launched it as > > # mcelog > mcelog.txt > ... > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > ... > > (the messages are STDERR); > > in 'mcelog.txt' it has for the last event from /var/log/messages: > > Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > the following lines (the uptime matches): > > ... > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > MCE 32 > CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)] > MCG status: > MCi status: > Error enabled > MCA: Unknown Error 5 > STATUS 90000040000f0005 MCGSTATUS 0 > MCGCAP c07 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 69 > > Questions: > a) Is the output of mcelog valid (regardless of the msg on STDERR of > 'unsupported model')? It may or may not be reliable. For MCE decoding to work accurately, the software (read: kernel) needs to have full support for the processor model and revision in question. mcelog simply tries to decode the output that the kernel spits out and provide a more "user-friendly" explanation. That isn't as simple as just modifying some table of supported CPUs; it involves reading Intel documentation and implementing what can be figured out through that. VMware has a small KB about this, to give you some insight into the complexity: http://kb.vmware.com/kb/1005184 There are some capabilities of MCA that are "semi-universal" across series of CPUs, so sometimes those can be decoded (mostly) accurately, but other times such isn't the case. Sometimes there are certain MCEs that have be ignored by the kernel (i.e. the kernel MCE support has to be updated to reflect changes in MCEs for that newer model of processor). The version of mcelog available in ports is extremely old, and the amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine would be quite large: http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git The existing FreeBSD port involves a large number of patches written by John Baldwin, and whether or not those can be correctly backported to newer mcelog releases is unknown. I really need to renounce my maintainer flag of that port and let someone else take care of it. > b) Is it worth to contact the dealer or wait until it is broken > completely? To me, the above message indicates that one of the CPU cores is damaged/misbehaving. I cannot determine if it's referring to L1, L2, or L3 cache, but I don't see any clear indicator of that (possibly due to the aforementioned explanation I gave about accuracy). However, I will point you to this thread, which may indicate that the model of CPU in question (or series or models of Intel CPUs) have MCEs that happen which are considered "normal" and are thus not being decoded correctly: https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html I would suggest providing relevant dmesg lines about your exact processor in this system and possibly ask for help from either John Baldwin or someone on freebsd-hackers@. I myself cannot help with this. The dmesg lines I'm referring to, by the way, look like this (all of them matter, particularly the first two): CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (2833.59-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10677 Family = 0x6 Model = 0x17 Stepping = 7 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x8e3fd<SSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1> AMD Features=0x20100800<SYSCALL,NX,LM> AMD Features2=0x1<LAHF> TSC: P-state invariant, performance statistics The OP of that freebsd-questions thread should have provided this but didn't (instead just says "Intel i3-4310" -- this isn't precise enough), so whether or not you two are using the same CPU is unknown. There simply could be "new MCEs" or changes to the MCA that Intel implemented in some newer models of Core iX that aren't being handled correctly by the kernel (i.e. misreporting or mis-decoding). Good luck! -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB | ----- End forwarded message ----- Here comes the dmesg' output: Copyright (c) 1992-2015 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan 6 12:55:25 CET 2015 guru@vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386 FreeBSD clang version 3.5.0 (tags/RELEASE_350/final 216957) 20141124 VT: running with driver "vga". CPU: Intel(R) Celeron(R) 2955U @ 1.40GHz (1396.80-MHz 686-class CPU) Origin="GenuineIntel" Id=0x40651 Family=0x6 Model=0x45 Stepping=1 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4ddaebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,<b11>,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,XSAVE,OSXSAVE,RDRAND> AMD Features=0x2c100000<NX,Page1GB,RDTSCP,LM> AMD Features2=0x21<LAHF,ABM> Structured Extended Features=0x2603<FSGSBASE,TSCADJ,ERMS,INVPCID> XSAVE Features=0x1<XSAVEOPT> VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 2079825920 (1983 MB) avail memory = 2014580736 (1921 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: <CORE COREBOOT> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 2 ioapic0 <Version 2.0> irqs 0-39 on motherboard Cuse4BSD v0.1.33 @ /dev/cuse random: entropy device infrastructure driver random: selecting highest priority adaptor <Dummy> kbd1 at kbdmux0 module_register_init: MOD_LOAD (vesa, 0xc0fb0310, 0) error 19 random: live provider: "Intel Secure Key RNG" random: SOFT: yarrow init() random: selecting highest priority adaptor <Yarrow> vtvga0: <vt_vga driver> on motherboard acpi0: <CORE COREBOOT> on motherboard acpi0: Power Button (fixed) hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 550 Event timer "HPET1" frequency 14318180 Hz quality 440 Event timer "HPET2" frequency 14318180 Hz quality 440 Event timer "HPET3" frequency 14318180 Hz quality 440 Event timer "HPET4" frequency 14318180 Hz quality 440 Event timer "HPET5" frequency 14318180 Hz quality 440 Event timer "HPET6" frequency 14318180 Hz quality 440 cpu0: <ACPI CPU> on acpi0 cpu1: <ACPI CPU> on acpi0 atrtc0: <AT realtime clock> port 0x70-0x77 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 attimer0: <AT timer> port 0x40-0x43,0x50-0x53 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 acpi_ec0: <Embedded Controller: GPE 0x24> port 0x62,0x66 on acpi0 acpi_lid0: <Control Method Lid Switch> on acpi0 acpi_button0: <Power Button> on acpi0 acpi_button1: <Sleep Button> irq 37 on acpi0 acpi_button2: <Sleep Button> irq 38 on acpi0 pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0 pci0: <ACPI PCI bus> on pcib0 vgapci0: <VGA-compatible display> port 0x1800-0x183f mem 0xe0000000-0xe03fffff,0xd0000000-0xdfffffff at device 2.0 on pci0 vgapci0: Boot video device hdac0: <Intel Haswell HDA Controller> mem 0xe0510000-0xe0513fff at device 3.0 on pci0 xhci0: <Intel Panther Point USB 3.0 controller> mem 0xe0500000-0xe050ffff at device 20.0 on pci0 xhci0: 32 byte context size. xhci0: Port routing mask set to 0xffffffff usbus0 on xhci0 pci0: <base peripheral, DMA controller> at device 21.0 (no driver attached) ig4iic0: <Intel Lynx Point-LP I2C Controller-1> mem 0xe051a000-0xe051afff,0xe051b000-0xe051bfff at device 21.1 on pci0 ig4iic0: Using MSI type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082 HOLD 00000001 ig4iic1: <Intel Lynx Point-LP I2C Controller-2> mem 0xe051c000-0xe051cfff,0xe051d000-0xe051dfff at device 21.2 on pci0 ig4iic1: Using MSI type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082 HOLD 00000001 hdac1: <Intel Lynx Point-LP HDA Controller> mem 0xe0514000-0xe0517fff at device 27.0 on pci0 pcib1: <ACPI PCI-PCI bridge> at device 28.0 on pci0 pci1: <ACPI PCI bus> on pcib1 ath0: <Atheros AR946x/AR948x> mem 0xe0400000-0xe047ffff at device 0.0 on pci1 ar9300_attach: calling ar9300_hw_attach ar9300_hw_attach: calling ar9300_eeprom_attach ar9300_flash_map: unimplemented for now Restoring Cal data from DRAM Restoring Cal data from EEPROM Restoring Cal data from Flash Restoring Cal data from Flash Restoring Cal data from OTP ar9300_hw_attach: ar9300_eeprom_attach returned 0 ath0: [HT] enabling HT modes ath0: [HT] enabling short-GI in 20MHz mode ath0: [HT] 1 stream STBC receive enabled ath0: [HT] 1 stream STBC transmit enabled ath0: [HT] 2 RX streams; 2 TX streams ath0: AR9460 mac 640.2 RF5110 phy 1924.13 ath0: 2GHz radio: 0x0000; 5GHz radio: 0x0000 ehci0: <Intel Lynx Point LP USB 2.0 controller USB> mem 0xe051f800-0xe051fbff at device 29.0 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 isab0: <PCI-ISA bridge> at device 31.0 on pci0 isa0: <ISA bus> on isab0 ahci0: <Intel Lynx Point-LP AHCI SATA controller> port 0x1860-0x1867,0x1870-0x1873,0x1868-0x186f,0x1874-0x1877,0x1840-0x185f mem 0xe051f000-0xe051f7ff irq 22 at device 31.2 on pci0 ahci0: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported ahcich0: <AHCI channel> at channel 0 on ahci0 acpi_tz0: <Thermal Zone> on acpi0 acpi_acad0: <AC Adapter> on acpi0 battery0: <ACPI Control Method Battery> on acpi0 atkbdc0: <Keyboard controller (i8042)> port 0x60,0x64 irq 1 on acpi0 atkbd0: <AT Keyboard> irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] pmtimer0 on isa0 ata0: <ATA channel> at port 0x1f0-0x1f7,0x3f6 irq 14 on isa0 ata1: <ATA channel> at port 0x170-0x177,0x376 irq 15 on isa0 ppc0: parallel port not found. coretemp0: <CPU On-Die Thermal Sensors> on cpu0 est0: <Enhanced SpeedStep Frequency Control> on cpu0 coretemp1: <CPU On-Die Thermal Sensors> on cpu1 est1: <Enhanced SpeedStep Frequency Control> on cpu1 Timecounters tick every 1.000 msec IP Filter: v5.1.2 initialized. Default = pass all, Logging = enabled hdacc0: <Intel Haswell HDA CODEC> at cad 0 on hdac0 hdaa0: <Intel Haswell Audio Function Group> at nid 1 on hdacc0 pcm0: <Intel Haswell (HDMI/DP 8ch)> at nid 3 on hdaa0 smbus0: <System Management Bus> on ig4iic0 usbus0: 5.0Gbps Super Speed USB v3.0 usbus1: 480Mbps High Speed USB v2.0 ugen0.1: <0x8086> at usbus0 uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0 ugen1.1: <Intel> at usbus1 uhub1: <Intel EHCI root HUB, class 9/0, rev 2.00/1.00, addr 1> on usbus1 uhub0: 13 ports with 13 removable, self powered uhub1: 2 ports with 2 removable, self powered smbus0: Probed address 0x67 No address ptr set, parent smbus No address ptr set isl_probe called on unknown I2C device: 103 ugen0.2: <SunplusIT Inc> at usbus0 ugen1.2: <vendor 0x8087> at usbus1 uhub2: <vendor 0x8087 product 0x8000, class 9/0, rev 2.00/0.04, addr 2> on usbus1 uhub2: 8 ports with 8 removable, self powered cyapa0: <Cypress APA I2C Trackpad> on smbus0 cyapa0: cyapa init status 8f cyapa0: CYTRA-103006-00 buttons=LM- res=870x470 smbus1: <System Management Bus> on ig4iic1 smbus1: Probed address 0x44 No address ptr set, parent smbus No address ptr set usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT cyapa_probe called on unknown I2C device: 68 random: unblocking device. usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT isl0: <ISL Digital Ambient Light Sensor> on smbus1 isl0: Sending command 32 isl0: Sending command 64 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: <TS128GMTS400 N0815B> ATA-9 SATA 3.x device ada0: Serial Number B862500493 ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 1024bytes) ada0: Command Queueing enabled ada0: 122104MB (250069680 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 isl0: Sending command 96 hdacc1: <Realtek (0x0283) HDA CODEC> at cad 0 on hdac1 hdaa1: <Realtek (0x0283) Audio Function Group> at nid 1 on hdacc1 pcm1: <Realtek (0x0283) (Analog 2.0+HP/2.0)> at nid 20,33 and 26,25 on hdaa1 SMP: AP CPU #1 Launched! Timecounter "TSC" frequency 1396798064 Hz quality 1000 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT Root mount waiting for: usbus0 Root mount waiting for: usbus0 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT Root mount waiting for: usbus0 Root mount waiting for: usbus0 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT ugen0.3: <Unknown> at usbus0 (disconnected) uhub_reattach_port: could not allocate new device Trying to mount root from ufs:/dev/ada0p2 [rw,noatime]... wlan0: Ethernet address: 80:56:f2:83:c1:17 wlan0: link state changed to UP info: [drm] Initialized drm 1.1.0 20060810 MCA: Bank 0, Status 0x90000040000f0005 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2 MCA: CPU 1 COR (1) internal parity error MCA: Bank 0, Status 0x90000040000f0005 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2 MCA: CPU 1 COR (1) internal parity error -- Matthias Apitz, guru@unixarea.de, http://www.unixarea.de/ +49-170-4527211 1989-2014: The Wall was torn down so that we go to war together again. El Muro ha sido derribado para que nos unimos en ir a la guerra otra vez. Diese Grenze wurde aufgehoben damit wir gemeinsam wieder in den Krieg ziehen.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150118060843.GA1184>