Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 2 Oct 2011 09:37:43 +0200
From:      Thomas Zander <thomas.e.zander@googlemail.com>
To:        Jeremy Chadwick <freebsd@jdc.parodius.com>
Cc:        freebsd-stable <freebsd-stable@freebsd.org>
Subject:   Re: Interpreting MCA error output
Message-ID:  <CAFU734xHMugfW%2BZcO93OPqUEhJshYn-k%2B%2B3aGmcDVvGZVQ=s%2BQ@mail.gmail.com>
In-Reply-To: <20111001102327.GA37434@icarus.home.lan>
References:  <CAFU734y3WsVFTpnGoGfbPH4vVBnoz8f=qGvYS4c%2BLya8PFQP_A@mail.gmail.com> <20111001102327.GA37434@icarus.home.lan>

next in thread | previous in thread | raw e-mail | index | archive | help
Hello Jeremy,

first, thank you for the extensive explanation. It cleared some things
up for me. I do have some rambling to add, though :-)

On Sat, Oct 1, 2011 at 12:23, Jeremy Chadwick <freebsd@jdc.parodius.com> wr=
ote:

> So what should you do? =A0Replace the RAM. =A0Which DIMM? =A0Sadly I don'=
t
> know how to determine that. =A0Some system BIOSes (particularly on AMD
> systems I've used) let you do memory tests (similar to memtest86) within
> the BIOS which can then tell you which DIMM slot experienced a problem.
> If yours doesn't have that, I would have to say purchase all new RAM
> (yes, all of it) and test the individual DIMMs later so you can
> determine which is bad.

Well, I wasn't too surprised by the panic. I have read somewhere that
in these situations the kernel might simply panic since the system
might be in a compromised state. So far so ... well ... acceptable.

My question here is how can I be certain right now if any of the DIMMs
has gone bad.
You mentioned problems you have all the time with DIMMs due to bad
cooling in data centers. My machine in question is not located in a
data center, that was my home server that tends to have very little
load. But being located in my apartment, there are lots of _potential_
problems, including stability of power. In fact this was the first MCA
event with these DIMMs ever, in more than a year.
But of course you could be right. A DIMM could be rotten. Absolutely.
Regarding your suggestion to do memory tests: My BIOS does not support
testing, so I booted up memtest86+ after reading your e-mail and let
it run for almost a whole day now. It did not encounter a single
problem.
So, even if I bought new DIMMs at once, it might take weeks to figure
out which DIMM is rotten, if at all. Assuming that MCA events stay
this infrequent, that is.
Of course I'll observe the machine closely, but if the rate stays at
one MCA event per year, it'll take some time to figure out the broken
DIMM :-)

> I should really work with John to make mcelog a FreeBSD port and just
> regularly update it with patches, etc. to work on FreeBSD. =A0DMI support
> and so on I don't think can be added (at least not by me), but simple
> ASCII decoding? =A0Very possible.

That would be absolutely helpful! After all, FreeBSD is primarily a
server OS, and where would one have ECC if not on servers. Being able
to determine what's wrong with memory would be certainly very valuable
for many admins.

> An alternative would be for me to make a CGI version where you could
> just go my site and paste in the FreeBSD MCE and it would siphon it
> through mcelog and give you the output.

I could live with that, too :-)

Thanks again for your extensive explanation, I appreciate it very much!
Now I am going to watch that machine closely...

Best regards,
Riggs



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFU734xHMugfW%2BZcO93OPqUEhJshYn-k%2B%2B3aGmcDVvGZVQ=s%2BQ>