From owner-freebsd-hackers@FreeBSD.ORG Sun Jan 18 06:08:50 2015 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 81E39DEA for ; Sun, 18 Jan 2015 06:08:50 +0000 (UTC) Received: from ms-10.1blu.de (ms-10.1blu.de [178.254.4.101]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 29BE8EC8 for ; Sun, 18 Jan 2015 06:08:49 +0000 (UTC) Received: from [93.104.5.178] (helo=c720-r276659) by ms-10.1blu.de with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.76) (envelope-from ) id 1YCj2T-0002SY-2v; Sun, 18 Jan 2015 07:08:46 +0100 Date: Sun, 18 Jan 2015 07:08:43 +0100 From: Matthias Apitz To: freebsd-hackers@freebsd.org Subject: Fwd: kernel: MCA: CPU 0 COR (1) internal parity error Message-ID: <20150118060843.GA1184@c720-r276659> Reply-To: Matthias Apitz Mail-Followup-To: Matthias Apitz , freebsd-hackers@freebsd.org, Jeremy Chadwick MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Operating-System: FreeBSD 11.0-CURRENT r269739 (i386) User-Agent: Mutt/1.5.23 (2014-03-12) X-Con-Id: 51246 X-Con-U: 0-guru X-Originating-IP: 93.104.5.178 Cc: Jeremy Chadwick X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 18 Jan 2015 06:08:50 -0000 Hello, I'm running since some days a recent -HEAD r276659 on an Acer C720 Chromebook which works very nicely and fast (I really have never seen such a fast KDE4 desktop). >From time to time (let's say 2-3 times a day) I see messages like this in /var/log/messages: Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error the kernel is: # uname -a FreeBSD c720-r276659 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan 6 12:55:25 CET 2015 guru@vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386 i.e. the i386 version (because I compile everything, kernel and ports, in a VMbox) I'm attaching below the complete 'dmesg' lines with the information details about the CPU. I raised questions about these MCA messages in freebsd-current@ and was pointed to a tool in ports/sysutils/mcelog. Jeremy Chadwick the maintainer of mcelog, made hints about the issue, see below, and asked me to bring this up in freebsd-hackers@ Are these messages really a hardware problem or do our kernel misreporting or mis-decoding of some hardware information. Despite of the messages, the system does not show any other faults or PANICs. Thanks matthias ----- Forwarded message from Jeremy Chadwick ----- Date: Sat, 17 Jan 2015 13:46:53 -0800 From: Jeremy Chadwick To: Matthias Apitz , Eric van Gyzen , freebsd-current@freebsd.org Subject: Re: kernel: MCA: CPU 0 COR (1) internal parity error On Sat, Jan 17, 2015 at 06:43:26PM +0100, Matthias Apitz wrote: > El día Friday, January 16, 2015 a las 03:04:52PM -0500, Eric van Gyzen escribió: > > > On 01/16/2015 14:45, Matthias Apitz wrote: > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > > > Jan 16 12:04:24 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > > > Try ports/sysutils/mcelog. > > I have installed that port and launched it as > > # mcelog > mcelog.txt > ... > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > mcelog: Unsupported new Family 6 Model 45 CPU: only decoding architectural errors > ... > > (the messages are STDERR); > > in 'mcelog.txt' it has for the last event from /var/log/messages: > > Jan 17 18:23:54 c720-r276659 kernel: MCA: Bank 0, Status 0x90000040000f0005 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 > Jan 17 18:23:54 c720-r276659 kernel: MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 0 > Jan 17 18:23:54 c720-r276659 kernel: MCA: CPU 0 COR (1) internal parity error > > the following lines (the uptime matches): > > ... > HARDWARE ERROR. This is *NOT* a software problem! > Please contact your hardware vendor > MCE 32 > CPU 0 BANK 0 TSC 36eec80fd688 [at 1397 Mhz 0 days 12:0:41 uptime (unreliable)] > MCG status: > MCi status: > Error enabled > MCA: Unknown Error 5 > STATUS 90000040000f0005 MCGSTATUS 0 > MCGCAP c07 APICID 0 SOCKETID 0 > CPUID Vendor Intel Family 6 Model 69 > > Questions: > a) Is the output of mcelog valid (regardless of the msg on STDERR of > 'unsupported model')? It may or may not be reliable. For MCE decoding to work accurately, the software (read: kernel) needs to have full support for the processor model and revision in question. mcelog simply tries to decode the output that the kernel spits out and provide a more "user-friendly" explanation. That isn't as simple as just modifying some table of supported CPUs; it involves reading Intel documentation and implementing what can be figured out through that. VMware has a small KB about this, to give you some insight into the complexity: http://kb.vmware.com/kb/1005184 There are some capabilities of MCA that are "semi-universal" across series of CPUs, so sometimes those can be decoded (mostly) accurately, but other times such isn't the case. Sometimes there are certain MCEs that have be ignored by the kernel (i.e. the kernel MCE support has to be updated to reflect changes in MCEs for that newer model of processor). The version of mcelog available in ports is extremely old, and the amount of work to upgrade it to the latest Linux mcelog (1.08) I imagine would be quite large: http://git.kernel.org/cgit/utils/cpu/mce/mcelog.git The existing FreeBSD port involves a large number of patches written by John Baldwin, and whether or not those can be correctly backported to newer mcelog releases is unknown. I really need to renounce my maintainer flag of that port and let someone else take care of it. > b) Is it worth to contact the dealer or wait until it is broken > completely? To me, the above message indicates that one of the CPU cores is damaged/misbehaving. I cannot determine if it's referring to L1, L2, or L3 cache, but I don't see any clear indicator of that (possibly due to the aforementioned explanation I gave about accuracy). However, I will point you to this thread, which may indicate that the model of CPU in question (or series or models of Intel CPUs) have MCEs that happen which are considered "normal" and are thus not being decoded correctly: https://lists.freebsd.org/pipermail/freebsd-questions/2014-January/255873.html I would suggest providing relevant dmesg lines about your exact processor in this system and possibly ask for help from either John Baldwin or someone on freebsd-hackers@. I myself cannot help with this. The dmesg lines I'm referring to, by the way, look like this (all of them matter, particularly the first two): CPU: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz (2833.59-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10677 Family = 0x6 Model = 0x17 Stepping = 7 Features=0xbfebfbff Features2=0x8e3fd AMD Features=0x20100800 AMD Features2=0x1 TSC: P-state invariant, performance statistics The OP of that freebsd-questions thread should have provided this but didn't (instead just says "Intel i3-4310" -- this isn't precise enough), so whether or not you two are using the same CPU is unknown. There simply could be "new MCEs" or changes to the MCA that Intel implemented in some newer models of Core iX that aren't being handled correctly by the kernel (i.e. misreporting or mis-decoding). Good luck! -- | Jeremy Chadwick jdc@koitsu.org | | UNIX Systems Administrator http://jdc.koitsu.org/ | | Making life hard for others since 1977. PGP 4BD6C0CB | ----- End forwarded message ----- Here comes the dmesg' output: Copyright (c) 1992-2015 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 11.0-CURRENT #0 r276659M: Tue Jan 6 12:55:25 CET 2015 guru@vm-poudriere-r269739:/usr/local/acerC720/obj/usr/local/acerC720/src/sys/GENERIC i386 FreeBSD clang version 3.5.0 (tags/RELEASE_350/final 216957) 20141124 VT: running with driver "vga". CPU: Intel(R) Celeron(R) 2955U @ 1.40GHz (1396.80-MHz 686-class CPU) Origin="GenuineIntel" Id=0x40651 Family=0x6 Model=0x45 Stepping=1 Features=0xbfebfbff Features2=0x4ddaebbf,CX16,xTPR,PDCM,PCID,SSE4.1,SSE4.2,MOVBE,POPCNT,TSCDLT,XSAVE,OSXSAVE,RDRAND> AMD Features=0x2c100000 AMD Features2=0x21 Structured Extended Features=0x2603 XSAVE Features=0x1 VT-x: (disabled in BIOS) PAT,HLT,MTF,PAUSE,EPT,UG,VPID TSC: P-state invariant, performance statistics real memory = 2079825920 (1983 MB) avail memory = 2014580736 (1921 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 2 ioapic0 irqs 0-39 on motherboard Cuse4BSD v0.1.33 @ /dev/cuse random: entropy device infrastructure driver random: selecting highest priority adaptor kbd1 at kbdmux0 module_register_init: MOD_LOAD (vesa, 0xc0fb0310, 0) error 19 random: live provider: "Intel Secure Key RNG" random: SOFT: yarrow init() random: selecting highest priority adaptor vtvga0: on motherboard acpi0: on motherboard acpi0: Power Button (fixed) hpet0: iomem 0xfed00000-0xfed003ff on acpi0 Timecounter "HPET" frequency 14318180 Hz quality 950 Event timer "HPET" frequency 14318180 Hz quality 550 Event timer "HPET1" frequency 14318180 Hz quality 440 Event timer "HPET2" frequency 14318180 Hz quality 440 Event timer "HPET3" frequency 14318180 Hz quality 440 Event timer "HPET4" frequency 14318180 Hz quality 440 Event timer "HPET5" frequency 14318180 Hz quality 440 Event timer "HPET6" frequency 14318180 Hz quality 440 cpu0: on acpi0 cpu1: on acpi0 atrtc0: port 0x70-0x77 on acpi0 Event timer "RTC" frequency 32768 Hz quality 0 attimer0: port 0x40-0x43,0x50-0x53 irq 0 on acpi0 Timecounter "i8254" frequency 1193182 Hz quality 0 Event timer "i8254" frequency 1193182 Hz quality 100 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x1008-0x100b on acpi0 acpi_ec0: port 0x62,0x66 on acpi0 acpi_lid0: on acpi0 acpi_button0: on acpi0 acpi_button1: irq 37 on acpi0 acpi_button2: irq 38 on acpi0 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 vgapci0: port 0x1800-0x183f mem 0xe0000000-0xe03fffff,0xd0000000-0xdfffffff at device 2.0 on pci0 vgapci0: Boot video device hdac0: mem 0xe0510000-0xe0513fff at device 3.0 on pci0 xhci0: mem 0xe0500000-0xe050ffff at device 20.0 on pci0 xhci0: 32 byte context size. xhci0: Port routing mask set to 0xffffffff usbus0 on xhci0 pci0: at device 21.0 (no driver attached) ig4iic0: mem 0xe051a000-0xe051afff,0xe051b000-0xe051bfff at device 21.1 on pci0 ig4iic0: Using MSI type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082 HOLD 00000001 ig4iic1: mem 0xe051c000-0xe051cfff,0xe051d000-0xe051dfff at device 21.2 on pci0 ig4iic1: Using MSI type 44570140 params 001f1fee general 55000000 (updated 55000004) swltr 00000800 autoltr 00000800 version 3131352a SS_SCL_HCNT=00000190 LCNT=000001d6 FS_SCL_HCNT=0000003c LCNT=00000082 HOLD 00000001 hdac1: mem 0xe0514000-0xe0517fff at device 27.0 on pci0 pcib1: at device 28.0 on pci0 pci1: on pcib1 ath0: mem 0xe0400000-0xe047ffff at device 0.0 on pci1 ar9300_attach: calling ar9300_hw_attach ar9300_hw_attach: calling ar9300_eeprom_attach ar9300_flash_map: unimplemented for now Restoring Cal data from DRAM Restoring Cal data from EEPROM Restoring Cal data from Flash Restoring Cal data from Flash Restoring Cal data from OTP ar9300_hw_attach: ar9300_eeprom_attach returned 0 ath0: [HT] enabling HT modes ath0: [HT] enabling short-GI in 20MHz mode ath0: [HT] 1 stream STBC receive enabled ath0: [HT] 1 stream STBC transmit enabled ath0: [HT] 2 RX streams; 2 TX streams ath0: AR9460 mac 640.2 RF5110 phy 1924.13 ath0: 2GHz radio: 0x0000; 5GHz radio: 0x0000 ehci0: mem 0xe051f800-0xe051fbff at device 29.0 on pci0 usbus1: EHCI version 1.0 usbus1 on ehci0 isab0: at device 31.0 on pci0 isa0: on isab0 ahci0: port 0x1860-0x1867,0x1870-0x1873,0x1868-0x186f,0x1874-0x1877,0x1840-0x185f mem 0xe051f000-0xe051f7ff irq 22 at device 31.2 on pci0 ahci0: AHCI v1.30 with 2 6Gbps ports, Port Multiplier not supported ahcich0: at channel 0 on ahci0 acpi_tz0: on acpi0 acpi_acad0: on acpi0 battery0: on acpi0 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] pmtimer0 on isa0 ata0: at port 0x1f0-0x1f7,0x3f6 irq 14 on isa0 ata1: at port 0x170-0x177,0x376 irq 15 on isa0 ppc0: parallel port not found. coretemp0: on cpu0 est0: on cpu0 coretemp1: on cpu1 est1: on cpu1 Timecounters tick every 1.000 msec IP Filter: v5.1.2 initialized. Default = pass all, Logging = enabled hdacc0: at cad 0 on hdac0 hdaa0: at nid 1 on hdacc0 pcm0: at nid 3 on hdaa0 smbus0: on ig4iic0 usbus0: 5.0Gbps Super Speed USB v3.0 usbus1: 480Mbps High Speed USB v2.0 ugen0.1: <0x8086> at usbus0 uhub0: <0x8086 XHCI root HUB, class 9/0, rev 3.00/1.00, addr 1> on usbus0 ugen1.1: at usbus1 uhub1: on usbus1 uhub0: 13 ports with 13 removable, self powered uhub1: 2 ports with 2 removable, self powered smbus0: Probed address 0x67 No address ptr set, parent smbus No address ptr set isl_probe called on unknown I2C device: 103 ugen0.2: at usbus0 ugen1.2: at usbus1 uhub2: on usbus1 uhub2: 8 ports with 8 removable, self powered cyapa0: on smbus0 cyapa0: cyapa init status 8f cyapa0: CYTRA-103006-00 buttons=LM- res=870x470 smbus1: on ig4iic1 smbus1: Probed address 0x44 No address ptr set, parent smbus No address ptr set usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT cyapa_probe called on unknown I2C device: 68 random: unblocking device. usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT isl0: on smbus1 isl0: Sending command 32 isl0: Sending command 64 ada0 at ahcich0 bus 0 scbus0 target 0 lun 0 ada0: ATA-9 SATA 3.x device ada0: Serial Number B862500493 ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 1024bytes) ada0: Command Queueing enabled ada0: 122104MB (250069680 512 byte sectors: 16H 63S/T 16383C) ada0: Previously was known as ad4 isl0: Sending command 96 hdacc1: at cad 0 on hdac1 hdaa1: at nid 1 on hdacc1 pcm1: at nid 20,33 and 26,25 on hdaa1 SMP: AP CPU #1 Launched! Timecounter "TSC" frequency 1396798064 Hz quality 1000 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT Root mount waiting for: usbus0 Root mount waiting for: usbus0 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT Root mount waiting for: usbus0 Root mount waiting for: usbus0 Root mount waiting for: usbus0 usbd_setup_device_desc: getting device descriptor at addr 2 failed, USB_ERR_TIMEOUT ugen0.3: at usbus0 (disconnected) uhub_reattach_port: could not allocate new device Trying to mount root from ufs:/dev/ada0p2 [rw,noatime]... wlan0: Ethernet address: 80:56:f2:83:c1:17 wlan0: link state changed to UP info: [drm] Initialized drm 1.1.0 20060810 MCA: Bank 0, Status 0x90000040000f0005 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2 MCA: CPU 1 COR (1) internal parity error MCA: Bank 0, Status 0x90000040000f0005 MCA: Global Cap 0x0000000000000c07, Status 0x0000000000000000 MCA: Vendor "GenuineIntel", ID 0x40651, APIC ID 2 MCA: CPU 1 COR (1) internal parity error -- Matthias Apitz, guru@unixarea.de, http://www.unixarea.de/ +49-170-4527211 1989-2014: The Wall was torn down so that we go to war together again. El Muro ha sido derribado para que nos unimos en ir a la guerra otra vez. Diese Grenze wurde aufgehoben damit wir gemeinsam wieder in den Krieg ziehen.