Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Mar 2000 16:30:29 -0600
From:      Jeremy McMillan <aphor@ripco.com>
To:        freebsd-hackers@FreeBSD.ORG
Subject:   Re: Detecting ECC errors
Message-ID:  <38CD6C05.69C086FE@ripco.com>
References:  <200003121934.LAA08972@mass.cdrom.com> <38CC07F7.B55F708B@gorean.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Just how valuable *IS* it? I can come up with a test plan. 

1. Get a pile of expendable ECC DIMMS.
2. Get your hacked kernel loaded.
3. Get an LKM to do malloc(); and free(); in kernelspace.
4. Open up your case, swap your DIMMS for expendable modules. Make sure ECC
mode is on.
5. Boot, load your LKM and start having some non-corrected memory
operations.
6. (this is where it gets fun) 
   Fire up a propane (pen) torch with a low flame, and *gently* heat the
dimm.
7. Keep the tip of the flame 2-3 inches away from the DIMM, and keep it
moving!
8. Keep the flame well clear of anything else, and be mindful of the
practically invisible 
   stream of hot gases that will flow over, and past your DIMM, warming your
motherboard, 
   etc. if you aren't careful. You only want to fry a few of the weakest
components on your DIMM.
   The emphasis is on SLOWLY heat the DIMM.
9. Shampoo, rinse, repeat. YMMV: continue until you get ECC errors or your
machine crashes.
10. Replace the DIMM with known-good-DIMM and hack your kernel again if it
didn't work.
11. If you manage to burn a DIMM that works except for a predictable ECC
correction, KEEP IT!

Doug Barton wrote:
> 
>         CC'ing -hackers in case we can scare up some interest . . .
> 
> Mike Smith wrote:
> >
> > >       Hi.  I took a look over the archives and noticed this ancient
> > > thread.  (1998)  However, I checked the handbook and LINT for options on
> > > how FreeBSD logs ECC errors, but I could not find anything.  Has this
> > > finally been implemented?  Or is there currently no way for the OS to
> > > detect the # of corrections / detections of errors by DIMM slot?
> >
> > You're correct; there isn't.  It's a relatively simple task that's been
> > waiting for a junior hacker to come along and take it up.  It's also
> > devillishly difficult to _test_ such code...
> 
>         This would be a very valuable thing to have though (just to restate the
> obvious). We had a sun machine go down at work with no symptoms at
> all... other than the log which showed that ECC errors were being caught
> and corrected (mostly) at a furious pace. If not for that log we would
> have spent hours testing possible reasons for the crash.
> 

-- 
--- -_______,,_ , , _ ,_  _ ,_ ,,_  _  _   _  _ . .______-
     -_____/||_>|_|/ \|_>/_\|_>||_>/ `/ \ / `/ \|\|\____-
       -__/-||  | |\_/| \\_ | \||  \_ \_/#\_ \_/| \ \__-


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-hackers" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?38CD6C05.69C086FE>