From owner-freebsd-hackers@freebsd.org  Thu Sep 17 05:25:05 2015
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id A7F4F9CED58;
 Thu, 17 Sep 2015 05:25:05 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 72C2C1293;
 Thu, 17 Sep 2015 05:25:05 +0000 (UTC)
 (envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
 by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id t8H5OuPj031505;
 Wed, 16 Sep 2015 22:25:00 -0700 (PDT)
 (envelope-from truckman@FreeBSD.org)
Message-Id: <201509170525.t8H5OuPj031505@gw.catspoiler.org>
Date: Wed, 16 Sep 2015 22:24:56 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
Subject: Re: ECC support
To: dieterbsd@gmail.com
cc: freebsd-hardware@freebsd.org, freebsd-hackers@freebsd.org
In-Reply-To: <CAA3ZYrDjTNM7AShdpFOjT-3wZnEV2u-2X6MnLksON61bw7=XiQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 17 Sep 2015 05:25:05 -0000

On 16 Sep, Dieter BSD wrote:
> Andriy:
>>> Assuming that a board does have the necessary connections but
>>> the firmware does not have ECC support, is there some reason that
>>> ECC support could not be added to the OS instead of the firmware?
>>
>> Yes, there is.  The memory controller is programmed by the code that
>> runs from ROM and uses no RAM (or the CPU cache is used as the RAM).
>> Once the real RAM gets used it's too late to reprogram the DRAM controller.
> 
> Perhaps one of the several bootloader stages could get itelf into
> CPU cache, program the memory controller, then load and execute the
> next stage or the OS?
> 
> Jim:
>> Replacing the data in memory would require processing overhead
>> that could accumulate and significantly diminish system performance.
> 
> If it only replaces data when there is a correctable error,
> and the errors are occasional soft errors, the effect on
> performance should be minimal.  If there is a hard error,
> you would want to replace the defective memory before you get
> an additional error and it becomes uncorrectable.
> 
>> If the error occurred because of random events and isn't a defect in
>> the memory, the memory address will be cleaned of the error when the
>> data is overwritten with other data.
> 
> If and when new data gets written to that location.  If that location
> contains info that never changes, such as kernel text, the bad bit will
> never get fixed.
> 
>> memory, without the extra complexity of the controller, is 12.5% more
>> expensive.   This <80><99>t a huge impact at 8GB, (<80><99>ll need
>> another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB,
>> and that much ram still costs enough that your wallet <80><99>t be happy.
> 
> It is 12.5% in both cases.  How much does it cost to have undetected
> errors in your data?  How much does it cost when an Interstate
> bridge collapses?  How much does it cost when one of NASA's missions
> fails?  How much does it cost when your pharmacy receives a
> prescription with an error in the dose?
> 
>> the MRC setup on Intel and AMD is both complex and proprietary
> 
> One wonders why the secrecy.  AMD has been much more open than many
> (most?) chipmakers.  They even forced the ATI people to document
> how to program their chips.  I don't see a lot of companies popping up
> making competing chips.  #include standard joke: "How do you make a small
> fortune in chipmaking?  Start with a very large fortune."  I can't
> see what secret would be revealed by saying "set bit 7 of register 4
> to 1 to enable ECC".

AMD documents a lot of this stuff in the BIOS and Kernel Developer's
Guide (BKDG) for each CPU family.

>> Intel Red Book
> 
> So the secret books are red this week, yawn.  I remember the nightmare
> of the merced orange books and the brain damaged "features" the chips had.
> Not recommended.  I'm interested in chips that work correctly, hence the
> interest in ECC and AMD.  Looked for ARM boards with ECC but didn't find
> any.  Is the Sparc stuff any more reliable than it used to be?  Other
> arch choices?

Supermicro has some Atom motherboards with ECC support.

>> The MRC setup code is a binary blob for otherwise open source boot
>> firmware such as Coreboot.
> 
> So the libreboot people are forced to work on reverse engineering
> these blobs?  :-(
> 
> Don:
>> I don't think the current APU parts support ECC.
> 
> According to wikipedia, socket FM2+ does not support ECC. :-(
> Kabini has support for ECC.  And Berlin, (and I assume Toronto) but
> word is that Berlin and Toronto are basically dead. :-(
> I think Carrizo and Turion are supposed to support ECC?  There really
> ought to be a list of which CPUs/APUs/sockets/boards do or do not
> support ECC.

Socket AM1 (Kabini) is supposed to support ECC, but motherboards with
this socket that support ECC is another story.

>> My experience is that many ASUS motherboard support ECC RAM and
>> usually document that fact.  Also many Gigabyte mother boards also
>> support ECC RAM, but don't document it.
> 
> From what I've been reading, both Asus and Gigabyte make good boards.
> I've seen reviews that complained about Gigabyte's firmware.
> http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html
> I've also seen claims that the firmware bricked boards.
> Reviewers like Asus' firmware.  I've seen complaints about Asus's support,
> and their website has significant problems.

I've got one of the Gigabyte GA_990FXA-UD5 boards.  I actually like the
BIOS.  I'm not trying to overclock, but it does have lots of ECC-related
knobs.  I think you can even tell it to gang the two memory controller
channels so that you can enable Chipkill.  The latter isn't as good as
it sounds because it really only works properly with DIMMs that us x4
DRAM chips, and there don't seem to be any unbuffered versions of those.
The only unbuffered DDR3 DIMMS I've found use x8 DRAM chips.  In that
case if a multiple bits coming out of the chip are incorrect, the ECC
checker has a just under 100% chance of detecting the error, but it is
still uncorrectable.  With x4 DRAM chips, Chipkill can correct the error
even if all four bits from the DRAM are incorrect.  Unfortunately, the
only DDR3 DIMMs that use x4 chips are registered.  Also, ganging the
memory controllers does hurt performance.

The things that I don't like about this board are the SATA connector
placement (though it wasn't too bad in my specific application), and the
combined keyboard/mouse PS/2 connector.  I'm still using a PS/2 KVM
switch here and I need motherboards with separate keyboard and mouse
connectors, and the Y-adaptors don't seem to work.  I'd love to upgrade
to a newer KVM, but I'd want to also switch from VGA to DVI and KVMs
that handle more than two dual-link DVI inputs are serious $$$.

My newest motherboard is an Asus M5A97 R2.0.  I bought it because it was
inexpensive, had sufficient expansion potential, and had separate
keyboard and mouse PS/2 connectors.  I don't like the BIOS nearly as
much.  It's got lots of whizzy graphics, but it's hard to find where the
various knobs are hidden.  As I recall, ECC control is basically on/off.
I also wasn't able to get WOL to work.  If I power off the machine with
shutdown -p, the LAN link light stays on, but sending a WOL packet
doesn't start the machine.  It might wake from sleep mode, but I didn't
try that.

> The firmware on my Tyan board is crap, and they refused to tell me
> how much power it needs.  Which means I don't know how much other stuff
> I can run from the same P/S.  It should have *way* more power than needed,
> but experience says "not enough", so I added a 2nd p/s for the disk farm
> and suddenly had fewer problems.  The 2 p/s setup does allow powercycling
> the mainboard (because of the crappy firmware) without powercycling the disks.
> 
> Given my experience with the Tyan board, and the apparent lack of
> FLOSS firmware for recent boards, I'm not real excited about the
> Gigabyte boards.  Asus has a couple of AMD3+ boards that I could
> probably live with, if their website actually had things like
> lists of exactly which CPUs and memory are approved, and firmware
> updates, ... But there are also applications could use a lower wattage
> solution.
> 
> Anyone have opinions on other mainboard companies?  ECS?  Asrock?
> MSI?  Zotac?  Others?

If you are interested in something with low power consumption, take a
look at the Supermicro C2000 series Atom boards:
<http://www.supermicro.com/products/motherboard/ATOM/>

I'm seriously considering picking up an A1SRM-LN5F-2358.  At first
glance it seems pricey, especially considering the amount of CPU grunt,
but I don't need much and I can use the extra LAN ports and possibly
IPMI, so I don't have to add the cost of a CPU, an decent aftermarket
cooler, extra NICs, or a video card.

> Don:
>> +MCA: Bank 4, Status 0x944a400096080a13
>> +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
>> +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
>> +MCA: CPU 0 COR BUSLG Responder RD Memory
>> +MCA: Address 0x213e98b10
>> +MCA: Bank 4, Status 0xd44a400096080a13
>> +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
>> +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0
>> +MCA: CPU 0 COR OVER BUSLG Responder RD Memory
>> +MCA: Address 0x213e98b10
> 
> Chris:
>> MCA: Bank 1, Status 0x9400000000000151
>> MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
>> MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2
>>
>> MCA: Address 0x81cc0e9f0
>>
>> Kind of freaky. I've never had this error on this board before.
>> On others tho.
>>
>> Try a search for MCA instead.
> 
> Is there a decoder ring for those messages?  I don't recall seeing
> messages like that, although I wasn't looking for them, and they
> don't leap out at you screaming ERROR! ERROR!  Digital Unix had its
> problems, but at least the error messages were fairly clear.
> Something like "single bit memory error at address 0x12345..."
> A simple edit to sys/x86/x86/mca.c
>    s/printf("UNCOR ");/printf("Uncorrectable ");/
>    s/printf("COR ");/printf("Correctable ");/
> would make the messages at least slightly more meaningful to a viewer
> who isn't intimently(sp) familiar with the mca.  Which most people aren't.
> I used to maintain code that dealt with a memory controller, and
> used a hardware circuit to inject errors into a memory board.
> But looking at those messages doesn't tell me anything beyond
> "Something happened, maybe I should grep through the source
> code for clues about those messages."  Looking at the source
> doesn't add much, you'd need documentation for the mca.
> Which most people aren't going to have.  And you'd need a lot
> of time to figure it out.

I think jhb@ has some software that decodes this stuff.  I'm not sure if
it is in ports.