From owner-freebsd-hardware@freebsd.org Wed Nov 11 23:30:25 2015 Return-Path: Delivered-To: freebsd-hardware@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 28DB8A2CE4D for ; Wed, 11 Nov 2015 23:30:25 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DE1F815AD for ; Wed, 11 Nov 2015 23:30:24 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 23002B99B; Wed, 11 Nov 2015 18:30:23 -0500 (EST) From: John Baldwin To: freebsd-hardware@freebsd.org Cc: "Pokala, Ravi" Subject: Re: ECC support Date: Wed, 11 Nov 2015 15:28:58 -0800 Message-ID: <1678090.72K5KqGPGp@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <1917A1AA-B9AB-4612-A4E3-18FF4C909FC3@panasas.com> References: <1917A1AA-B9AB-4612-A4E3-18FF4C909FC3@panasas.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 11 Nov 2015 18:30:23 -0500 (EST) X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 11 Nov 2015 23:30:25 -0000 On Friday, October 23, 2015 03:22:54 PM Pokala, Ravi wrote: > -----Original Message----- > > > >Date: Thu, 22 Oct 2015 11:09:50 -0700 > >From: John Baldwin > >To: freebsd-hardware@freebsd.org > >Cc: Dieter BSD , freebsd-hackers@freebsd.org > >Subject: Re: ECC support > >Message-ID: <1492434.22kxSKhHEJ@ralph.baldwin.cx> > >Content-Type: text/plain; charset="us-ascii" > > > >The problem is that there are other fields to decode and you can only fit so much in one line. > > At Panasas, we did in-kernel parsing and got it down to a one-liner like this: > > Detected HW Err (CMC) - Correctable ECC error Channel:0; Dimm:0; Syndrome:2151686160 > > > But that was only for main-memory corrected ECCs; for all other MCAs, it was a multi-line format (which I think we got from backporting MCA support from (8-STABLE?)): > > MCA: Bank 8, Status 0xb20000000004008f > MCA: Global Cap 0x0000000000001c09, Status 0x0000000000000004 > MCA: Vendor "GenuineIntel", ID 0x106e4, APIC ID 0 > MCA: CPU 0 UNCOR PCC GEN channel ?? memory error Yeah, that's the generic MCA stuff in stock FreeBSD. > >Also, there is not a CPU-independent way to know the address of an ECC error. On Intel Core i3/5/7 (anything with QPI) you can identify the individual DIMM at least, but the label that the motherboard manufacturer uses varies by manufacturer. (You can maybe scrape that text from the SMBIOS tables, > > That's exactly what we did when using off-the-shelf motherboards. We were able to extract the name of the DIMM slot, as defined in SMBIOS, as well as the part and serial numbers of the DIMM, and the physical address range of the DIMM. For example: > > hw.mem.dimm.s: locator serial# part# bank size addr0 addrN > hw.mem.dimm.0: DIMM_A1 DC917AEF 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 0 DIMM 0] 16384MB 0x00000000000 0x003FFFFFFFF > hw.mem.dimm.1: DIMM_B1 DDA0C793 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 1 DIMM 0] 16384MB 0x00400000000 0x007FFFFFFFF > hw.mem.dimm.2: DIMM_C1 DDA0C7B6 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 2 DIMM 0] 16384MB 0x00800000000 0x00BFFFFFFFF > hw.mem.dimm.3: DIMM_D1 DDA0C7DE 36KDZS2G72PZ-1G4D1 [NODE 0 CHANNEL 3 DIMM 0] 16384MB 0x00C00000000 0x00FFFFFFFFF > > > Re-whacking that code for -CURRENT and getting it upstream has been on my to-do list for a depressingly long time; it keeps getting pre-empted. :-S > > > >but only if they aren't wrong which they sometimes are, and good luck knowing if they are wrong or right.) > > Making sure the SMBIOS identifier matches the label on the motherboard is part of the process of validating the motherboard as usable by us. :-) That might be sufficient for DIMMs. My main hangup with SMBIOS was trying to use the table to decode PCI slot info. I have another git branch that tries to label PCI devices in a physical slot with the slot number from either $PIR or SMBIOS as well as an alternate view that lists the physical slots in the chassis and what devices are in them. However, when I was playing with this on X8-X9 supermicro boards, most of them had SMBIOS tables that were completely wrong. Most of them had mostly correct $PIR tables, but SMBIOS was all over the map. https://github.com/freebsd/freebsd/compare/master...bsdjhb:pciconf_slot_smbios > >Digital UNIX had the luxury of running on hardware built by the same company, not on a random assortment of boards built by various vendors. FreeBSD does not. > > Yeah. Like I said, we scrapped SMBIOS *for off-the-shelf motherboards*. For our in-house designs, we hardcoded the Channel/DIMM mapping into an unambiguous form inside the driver itself. > > >sysutils/mcelog does some more verbose decoding of MCA records, but I find it to be equally gibberish for anyone not intimately familiar with a specific CPU. > > > >I wrote a tool for a previous employer that was able to do some simple parsing of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short summary that was used in a nagios check. However, it only handles a narrow set of systems. > > > >https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc > > Oooo, that looks nice! Is this something that can be committed to the main tree? If nothing else, I'll need to make a note of the way you're getting the MCA records into userland. I think it might be a starting place for something that could go into the tree, sure. Perhaps we could augment the dimm lookup code to parse the smbios table instead of the supermicro-specific formatting it has now? -- John Baldwin