From owner-freebsd-hardware@freebsd.org Thu Oct 22 18:14:00 2015 Return-Path: Delivered-To: freebsd-hardware@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8C780A1C4DA; Thu, 22 Oct 2015 18:14:00 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6A6751ACA; Thu, 22 Oct 2015 18:14:00 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 27748B9BB; Thu, 22 Oct 2015 14:13:58 -0400 (EDT) From: John Baldwin To: freebsd-hardware@freebsd.org Cc: Dieter BSD , freebsd-hackers@freebsd.org Subject: Re: ECC support Date: Thu, 22 Oct 2015 11:09:50 -0700 Message-ID: <1492434.22kxSKhHEJ@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-PRERELEASE; KDE/4.14.3; amd64; ; ) In-Reply-To: References: MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Thu, 22 Oct 2015 14:13:59 -0400 (EDT) X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 22 Oct 2015 18:14:00 -0000 On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote: > Chris: > > MCA: Bank 1, Status 0x9400000000000151 > > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 > > MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2 > > > > MCA: Address 0x81cc0e9f0 > > > > Kind of freaky. I've never had this error on this board before. > > On others tho. > > > > Try a search for MCA instead. > > Is there a decoder ring for those messages? I don't recall seeing > messages like that, although I wasn't looking for them, and they > don't leap out at you screaming ERROR! ERROR! Digital Unix had its > problems, but at least the error messages were fairly clear. > Something like "single bit memory error at address 0x12345..." > A simple edit to sys/x86/x86/mca.c > s/printf("UNCOR ");/printf("Uncorrectable ");/ > s/printf("COR ");/printf("Correctable ");/ > would make the messages at least slightly more meaningful to a viewer > who isn't intimently(sp) familiar with the mca. Which most people aren't. The problem is that there are other fields to decode and you can only fit so much in one line. Also, there is not a CPU-independent way to know the address of an ECC error. On Intel Core i3/5/7 (anything with QPI) you can identify the individual DIMM at least, but the label that the motherboard manufacturer uses varies by manufacturer. (You can maybe scrape that text from the SMBIOS tables, but only if they aren't wrong which they sometimes are, and good luck knowing if they are wrong or right.) Digital UNIX had the luxury of running on hardware built by the same company, not on a random assortment of boards built by various vendors. FreeBSD does not. sysutils/mcelog does some more verbose decoding of MCA records, but I find it to be equally gibberish for anyone not intimately familiar with a specific CPU. I wrote a tool for a previous employer that was able to do some simple parsing of MCA errors for Supermicro X7-X10 boards (Intel CPUs) and give a short summary that was used in a nagios check. However, it only handles a narrow set of systems. https://github.com/freebsd/freebsd/compare/master...bsdjhb:ecc -- John Baldwin