From owner-freebsd-hackers@freebsd.org Wed Sep 16 17:56:53 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3A46E9CD1D6; Wed, 16 Sep 2015 17:56:53 +0000 (UTC) (envelope-from dieterbsd@gmail.com) Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com [IPv6:2607:f8b0:4001:c05::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0C0571CA0; Wed, 16 Sep 2015 17:56:53 +0000 (UTC) (envelope-from dieterbsd@gmail.com) Received: by igcpb10 with SMTP id pb10so40089211igc.1; Wed, 16 Sep 2015 10:56:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:date:message-id:subject:from:to:content-type; bh=K2EqzLBHn/Q1rYHgLspRYQvHN2yT2gHt5G5Im9UfmEk=; b=ZWJ6UZg86GL0Ahx3XcYRnYqRUMYjUeThY7PucvHitBZ7pBImYYLfG4+Ou70mSlKDtc mbQ+xzcAohPX4nIBkJs9JAtC/o8PtSO1QQ/oQGraPK2BYKBWKnYKNsCkC10B7lpEWFTX a9YY3xlU7rfccqI93iddYBD3OPRt3YI25Z11XrKE+lg0CIfJuExo0Hl7akbRtjnQN0il pXUAH2H+IRYdEqC7wNjR6qlBi2Jcvye2SCqPB4+pBgppy6NQV29OsfEwrHfXGCXKOyUh HLBYzdv6PGuW2d6M8d3+051U11gOX3WLVw387R612gQgh4jzhb2s4zqVl8tvls64oxKo 9S7w== MIME-Version: 1.0 X-Received: by 10.50.78.138 with SMTP id b10mr13351087igx.67.1442426212442; Wed, 16 Sep 2015 10:56:52 -0700 (PDT) Received: by 10.64.2.132 with HTTP; Wed, 16 Sep 2015 10:56:52 -0700 (PDT) Date: Wed, 16 Sep 2015 10:56:52 -0700 Message-ID: Subject: Re: ECC support From: Dieter BSD To: freebsd-hardware@freebsd.org, freebsd-hackers@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 16 Sep 2015 17:56:53 -0000 Andriy: >> Assuming that a board does have the necessary connections but >> the firmware does not have ECC support, is there some reason that >> ECC support could not be added to the OS instead of the firmware? > > Yes, there is. The memory controller is programmed by the code that > runs from ROM and uses no RAM (or the CPU cache is used as the RAM). > Once the real RAM gets used it's too late to reprogram the DRAM controller. Perhaps one of the several bootloader stages could get itelf into CPU cache, program the memory controller, then load and execute the next stage or the OS? Jim: > Replacing the data in memory would require processing overhead > that could accumulate and significantly diminish system performance. If it only replaces data when there is a correctable error, and the errors are occasional soft errors, the effect on performance should be minimal. If there is a hard error, you would want to replace the defective memory before you get an additional error and it becomes uncorrectable. > If the error occurred because of random events and isn't a defect in > the memory, the memory address will be cleaned of the error when the > data is overwritten with other data. If and when new data gets written to that location. If that location contains info that never changes, such as kernel text, the bad bit will never get fixed. > memory, without the extra complexity of the controller, is 12.5% more > expensive. This <80><99>t a huge impact at 8GB, (<80><99>ll need > another 1GB of RAM), but at 1024GB <80><99>ll need another 128GB, > and that much ram still costs enough that your wallet <80><99>t be happy. It is 12.5% in both cases. How much does it cost to have undetected errors in your data? How much does it cost when an Interstate bridge collapses? How much does it cost when one of NASA's missions fails? How much does it cost when your pharmacy receives a prescription with an error in the dose? > the MRC setup on Intel and AMD is both complex and proprietary One wonders why the secrecy. AMD has been much more open than many (most?) chipmakers. They even forced the ATI people to document how to program their chips. I don't see a lot of companies popping up making competing chips. #include standard joke: "How do you make a small fortune in chipmaking? Start with a very large fortune." I can't see what secret would be revealed by saying "set bit 7 of register 4 to 1 to enable ECC". > Intel Red Book So the secret books are red this week, yawn. I remember the nightmare of the merced orange books and the brain damaged "features" the chips had. Not recommended. I'm interested in chips that work correctly, hence the interest in ECC and AMD. Looked for ARM boards with ECC but didn't find any. Is the Sparc stuff any more reliable than it used to be? Other arch choices? > The MRC setup code is a binary blob for otherwise open source boot > firmware such as Coreboot. So the libreboot people are forced to work on reverse engineering these blobs? :-( Don: > I don't think the current APU parts support ECC. According to wikipedia, socket FM2+ does not support ECC. :-( Kabini has support for ECC. And Berlin, (and I assume Toronto) but word is that Berlin and Toronto are basically dead. :-( I think Carrizo and Turion are supposed to support ECC? There really ought to be a list of which CPUs/APUs/sockets/boards do or do not support ECC. > My experience is that many ASUS motherboard support ECC RAM and > usually document that fact. Also many Gigabyte mother boards also > support ECC RAM, but don't document it. >From what I've been reading, both Asus and Gigabyte make good boards. I've seen reviews that complained about Gigabyte's firmware. http://www.xbitlabs.com/articles/mainboards/display/gigabyte-ga-990fxa-ud5_8.html I've also seen claims that the firmware bricked boards. Reviewers like Asus' firmware. I've seen complaints about Asus's support, and their website has significant problems. The firmware on my Tyan board is crap, and they refused to tell me how much power it needs. Which means I don't know how much other stuff I can run from the same P/S. It should have *way* more power than needed, but experience says "not enough", so I added a 2nd p/s for the disk farm and suddenly had fewer problems. The 2 p/s setup does allow powercycling the mainboard (because of the crappy firmware) without powercycling the disks. Given my experience with the Tyan board, and the apparent lack of FLOSS firmware for recent boards, I'm not real excited about the Gigabyte boards. Asus has a couple of AMD3+ boards that I could probably live with, if their website actually had things like lists of exactly which CPUs and memory are approved, and firmware updates, ... But there are also applications could use a lower wattage solution. Anyone have opinions on other mainboard companies? ECS? Asrock? MSI? Zotac? Others? Don: > +MCA: Bank 4, Status 0x944a400096080a13 > +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 > +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0 > +MCA: CPU 0 COR BUSLG Responder RD Memory > +MCA: Address 0x213e98b10 > +MCA: Bank 4, Status 0xd44a400096080a13 > +MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 > +MCA: Vendor "AuthenticAMD", ID 0x100f53, APIC ID 0 > +MCA: CPU 0 COR OVER BUSLG Responder RD Memory > +MCA: Address 0x213e98b10 Chris: > MCA: Bank 1, Status 0x9400000000000151 > MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 > MCA: Vendor "AuthenticAMD", ID 0x100f52, APIC ID 2 > > MCA: Address 0x81cc0e9f0 > > Kind of freaky. I've never had this error on this board before. > On others tho. > > Try a search for MCA instead. Is there a decoder ring for those messages? I don't recall seeing messages like that, although I wasn't looking for them, and they don't leap out at you screaming ERROR! ERROR! Digital Unix had its problems, but at least the error messages were fairly clear. Something like "single bit memory error at address 0x12345..." A simple edit to sys/x86/x86/mca.c s/printf("UNCOR ");/printf("Uncorrectable ");/ s/printf("COR ");/printf("Correctable ");/ would make the messages at least slightly more meaningful to a viewer who isn't intimently(sp) familiar with the mca. Which most people aren't. I used to maintain code that dealt with a memory controller, and used a hardware circuit to inject errors into a memory board. But looking at those messages doesn't tell me anything beyond "Something happened, maybe I should grep through the source code for clues about those messages." Looking at the source doesn't add much, you'd need documentation for the mca. Which most people aren't going to have. And you'd need a lot of time to figure it out. # find /var/log | xargs bzgrep -i mca found no error messages. I seem to be buried under a mountain of boards that would be useful, if only they supported ECC. (and had firmware that actually works...) And I'm hardly the only one. So how do we fix this? Lobby AMD (and other chipmakers) to include ECC support in *all* memory controllers and sockets? It isn't like they have to redesign the logic for every chip, they only need one design per memory width. Lobby AMD to publish documentation on how to program the memory controller? Lobby the companies that make boards?