From owner-freebsd-hackers Mon Mar 13 14:28:10 2000 Delivered-To: freebsd-hackers@freebsd.org Received: from notrecords.com (228-121.ppp.ripco.net [209.100.228.121]) by hub.freebsd.org (Postfix) with ESMTP id 1B56837B5A8 for ; Mon, 13 Mar 2000 14:28:00 -0800 (PST) (envelope-from aphor@ripco.com) Received: from ripco.com (nell.notrecords.com [192.168.1.123]) by notrecords.com (8.9.3/8.9.3) with ESMTP id QAA03464 for ; Mon, 13 Mar 2000 16:31:02 -0600 (CST) (envelope-from aphor@ripco.com) Message-ID: <38CD6C05.69C086FE@ripco.com> Date: Mon, 13 Mar 2000 16:30:29 -0600 From: Jeremy McMillan Reply-To: aphor@ripco.com Organization: Loose.. X-Mailer: Mozilla 4.7 [en] (X11; U; FreeBSD 3.4-STABLE i386) X-Accept-Language: en MIME-Version: 1.0 To: freebsd-hackers@FreeBSD.ORG Subject: Re: Detecting ECC errors References: <200003121934.LAA08972@mass.cdrom.com> <38CC07F7.B55F708B@gorean.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Just how valuable *IS* it? I can come up with a test plan. 1. Get a pile of expendable ECC DIMMS. 2. Get your hacked kernel loaded. 3. Get an LKM to do malloc(); and free(); in kernelspace. 4. Open up your case, swap your DIMMS for expendable modules. Make sure ECC mode is on. 5. Boot, load your LKM and start having some non-corrected memory operations. 6. (this is where it gets fun) Fire up a propane (pen) torch with a low flame, and *gently* heat the dimm. 7. Keep the tip of the flame 2-3 inches away from the DIMM, and keep it moving! 8. Keep the flame well clear of anything else, and be mindful of the practically invisible stream of hot gases that will flow over, and past your DIMM, warming your motherboard, etc. if you aren't careful. You only want to fry a few of the weakest components on your DIMM. The emphasis is on SLOWLY heat the DIMM. 9. Shampoo, rinse, repeat. YMMV: continue until you get ECC errors or your machine crashes. 10. Replace the DIMM with known-good-DIMM and hack your kernel again if it didn't work. 11. If you manage to burn a DIMM that works except for a predictable ECC correction, KEEP IT! Doug Barton wrote: > > CC'ing -hackers in case we can scare up some interest . . . > > Mike Smith wrote: > > > > > Hi. I took a look over the archives and noticed this ancient > > > thread. (1998) However, I checked the handbook and LINT for options on > > > how FreeBSD logs ECC errors, but I could not find anything. Has this > > > finally been implemented? Or is there currently no way for the OS to > > > detect the # of corrections / detections of errors by DIMM slot? > > > > You're correct; there isn't. It's a relatively simple task that's been > > waiting for a junior hacker to come along and take it up. It's also > > devillishly difficult to _test_ such code... > > This would be a very valuable thing to have though (just to restate the > obvious). We had a sun machine go down at work with no symptoms at > all... other than the log which showed that ECC errors were being caught > and corrected (mostly) at a furious pace. If not for that log we would > have spent hours testing possible reasons for the crash. > -- --- -_______,,_ , , _ ,_ _ ,_ ,,_ _ _ _ _ . .______- -_____/||_>|_|/ \|_>/_\|_>||_>/ `/ \ / `/ \|\|\____- -__/-|| | |\_/| \\_ | \|| \_ \_/#\_ \_/| \ \__- To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message