Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 15 Sep 2015 16:52:30 -0500
From:      Jim Thompson <jim@netgate.com>
To:        Dieter BSD <dieterbsd@gmail.com>
Cc:        freebsd-hardware@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Re: ECC support
Message-ID:  <41EFCF21-D3B0-4EC4-8EAB-417CA33821FC@netgate.com>
In-Reply-To: <CAA3ZYrBXZn1WpHWYGJYWJDPsk7iDahCas8RhnHC4w%2Babf4w4hA@mail.gmail.com>
References:  <CAA3ZYrBXZn1WpHWYGJYWJDPsk7iDahCas8RhnHC4w%2Babf4w4hA@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help

ECC is implemented by a =E2=80=98hashing=E2=80=99 algorithm that works =
on eight (8) bytes (64 bits) at a time, and places the result into an =
8-bit ECC =E2=80=98word=E2=80=99.

Errors are corrected "on-the-fly," corrected data is almost never placed =
back in memory. If the same corrupt data is read again, the correction =
process is repeated. Replacing the data in memory would require =
processing overhead that could accumulate and significantly diminish =
system performance. If the error occurred because of random events and =
isn't a defect in the memory, the memory address will be cleaned of the =
error when the data is overwritten with other data.

In terms of expense, at a minimum, where you had 8 bytes to make up a =
memory system, you will now have 9 (to hold the extra 8 bits).  This =
means your memory, without the extra complexity of the controller, is =
12.5% more expensive.   This isn=E2=80=99t a huge impact at 8GB, =
(you=E2=80=99ll need another 1GB of RAM), but at 1024GB you=E2=80=99ll =
need another 128GB, and that much ram still costs enough that your =
wallet won=E2=80=99t be happy. =20

The memory controller has to be able to run the ECC algorithm on every =
read, *and* supply the corrected data as needed, within the cycle time =
of the read.  If you involve software in this path, the performance your =
machine will be glacial.

Yes, the firmware has to program the memory controller.   =E2=80=9CProgram=
 a few registers=E2=80=9D is all you need, only the MRC setup on Intel =
and AMD is both complex and proprietary.  Good luck getting the
details for this.  This is =E2=80=9CIntel Red Book=E2=80=9D territory, =
and you=E2=80=99ll need to be an employee with a need to know.  The MRC =
setup code is a binary blob for otherwise open source boot firmware such =
as Coreboot.

Others have answered (in the positive) about the OS reporting ECC errors =
on FreeBSD.

Jim

> On Sep 15, 2015, at 3:53 PM, Dieter BSD <dieterbsd@gmail.com> wrote:
>=20
> Many of AMD's CPU/APU parts support ECC memory.  Not just the top of =
the
> line parts, but also many of the less expensive, less power hungry =
parts.
> However, many (most?) of the boards for these chips do not support =
ECC,
> or at least do not admit to it.  They specify "non-ECC memory".
>=20
> Obviously there have to be connections between the memory controller =
and
> the memory for the extra bits.  Aside from a little extra time for the
> board designer to add a few traces to the wire list, this would not
> raise the cost of the board.  Despite this I have read that some =
boards
> lack the necessary traces.
>=20
> Does the firmware have to do anything to support ECC?  Program a few
> registers in the memory controller perhaps?  A few boards have FLOSS
> firmware available, so this code could be added, but most boards do =
not
> have firmware sources available.
>=20
> Assuming that a board does have the necessary connections but
> the firmware does not have ECC support, is there some reason that
> ECC support could not be added to the OS instead of the firmware?
> I grepped through FreeBSD 8.2 and 10.1 sources but couldn't find
> anything that looked relevant.  Also did not find any code that
> reported ECC errors, other than one device.  Perhaps I missed it?
>=20
> I've been running machines with ECC for 15-20 years and have never =
seen
> a report of an ECC error from either NetBSD or FreeBSD.  I have seen
> reports of ECC errors from Digital Unix.  And remember getting panics
> due to parity errors on machines before ECC.  So I'm thinking that
> the BSDs must ignore hardware reports of single bit ECC errors.  :-(
> _______________________________________________
> freebsd-hackers@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to =
"freebsd-hackers-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?41EFCF21-D3B0-4EC4-8EAB-417CA33821FC>