From owner-freebsd-hardware@freebsd.org Fri Oct 23 11:37:34 2015 Return-Path: Delivered-To: freebsd-hardware@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D5995A1C97E; Fri, 23 Oct 2015 11:37:34 +0000 (UTC) (envelope-from rb@gid.co.uk) Received: from mx0.gid.co.uk (mx0.gid.co.uk [194.32.164.250]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6869D7F0; Fri, 23 Oct 2015 11:37:33 +0000 (UTC) (envelope-from rb@gid.co.uk) Received: from [194.32.164.28] ([194.32.164.28]) by mx0.gid.co.uk (8.14.2/8.14.2) with ESMTP id t9NBbVjl080406; Fri, 23 Oct 2015 12:37:31 +0100 (BST) (envelope-from rb@gid.co.uk) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) Subject: Re: ECC support From: Bob Bishop In-Reply-To: <1483396.WZc3qgD2yz@ralph.baldwin.cx> Date: Fri, 23 Oct 2015 12:37:31 +0100 Cc: freebsd-hackers@freebsd.org, Dieter BSD , freebsd-hardware@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <97482413-D2AA-4C32-AEFF-EB65D5D8542B@gid.co.uk> References: <1492434.22kxSKhHEJ@ralph.baldwin.cx> <74705089-408A-4FD3-899E-CA677390F855@gid.co.uk> <1483396.WZc3qgD2yz@ralph.baldwin.cx> To: John Baldwin X-Mailer: Apple Mail (2.2104) X-BeenThere: freebsd-hardware@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: General discussion of FreeBSD hardware List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 23 Oct 2015 11:37:34 -0000 Hi, > On 22 Oct 2015, at 22:17, John Baldwin wrote: >=20 > On Thursday, October 22, 2015 07:49:13 PM Bob Bishop wrote: >> HI, >>=20 >>> On 22 Oct 2015, at 19:09, John Baldwin wrote: >>>=20 >>> On Wednesday, September 16, 2015 10:56:52 AM Dieter BSD wrote: >>>> Chris: >>>>> MCA: Bank 1, Status 0x9400000000000151 >>>>> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 >>>>> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2 >>>>>=20 >>>>> MCA: Address 0x81cc0e9f0 >>>>>=20 >>>>> Kind of freaky. I've never had this error on this board before. >>>>> On others tho. >>>>>=20 >>>>> Try a search for MCA instead. >>>>=20 >>>> Is there a decoder ring for those messages? I don't recall seeing >>>> messages like that, although I wasn't looking for them, and they >>>> don't leap out at you screaming ERROR! ERROR! Digital Unix had its >>>> problems, but at least the error messages were fairly clear. >>>> Something like "single bit memory error at address 0x12345..." >>>> A simple edit to sys/x86/x86/mca.c >>>> s/printf("UNCOR ");/printf("Uncorrectable ");/ >>>> s/printf("COR ");/printf("Correctable ");/ >>>> would make the messages at least slightly more meaningful to a = viewer >>>> who isn't intimently(sp) familiar with the mca. Which most people = aren't. >>>=20 >>> The problem is that there are other fields to decode and you can = only fit so >>> much in one line. Also, there is not a CPU-independent way to know = the >>> address of an ECC error. [etc] >>=20 >> On server-class hardware, the platform management (BMC or whatever) = is probably decoding this stuff for event logs and can be interrogated = via IPMI (or whatever). >=20 > Not always well and not always with side effects you want. On Core 2 = and > Nehalem i7 class hardware I measured that it took on the order of 400 > milliseconds (not micro) in SMM (system management mode, so your = entire > OS is halted) to write out each log entry to NVRAM. At least one = place I > worked at turned the BIOS ECC logging off because that delay was too = costly. >=20 > Also, even though your BMC may log it, the format for doing so isn't > standard. The details such as the affected DIMM are in the OEM bits = of > the log record, so not something you can easily extract from, say, > ipmitool sel elist. You'd have to log into the BIOS itself (or the = BMC's > web UI) to see which DIMM is affected. Neither of those are really = great > for automated reporting. All agreed. I was just flagging up the existence of another possible = channel to get at ECC logging. > --=20 > John Baldwin -- Bob Bishop rb@gid.co.uk