From owner-freebsd-hackers Thu Jan 22 00:02:15 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id AAA23488 for hackers-outgoing; Thu, 22 Jan 1998 00:02:15 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from wcc.wcc.net (wcc.wcc.net [208.6.232.10]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA23478 for ; Thu, 22 Jan 1998 00:02:11 -0800 (PST) (envelope-from piquan@wcc.wcc.net) Received: from detlev.UUCP (newip44.wcc.net [206.104.247.44]) by wcc.wcc.net (8.8.7/8.8.7) with ESMTP id BAA05285; Thu, 22 Jan 1998 01:58:33 -0600 (CST) Received: (from joelh@localhost) by detlev.UUCP (8.8.8/8.8.7) id CAA00609; Thu, 22 Jan 1998 02:01:48 -0600 (CST) (envelope-from joelh) Date: Thu, 22 Jan 1998 02:01:48 -0600 (CST) Message-Id: <199801220801.CAA00609@detlev.UUCP> To: tlambert@primenet.com CC: mrcpu@cdsnet.net, hackers@FreeBSD.ORG In-reply-to: <199801220637.XAA07251@usr09.primenet.com> (message from Terry Lambert on Thu, 22 Jan 1998 06:37:03 +0000 (GMT)) Subject: Re: Had the shotgun out and pointed at my -current/SMP box... From: Joel Ray Holveck Reply-to: joelh@gnu.org References: <199801220637.XAA07251@usr09.primenet.com> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk >> [ I am messing around with a 3 processor P6/233 system to potentially >> do some heavy-duty database work, and it hasn't been able to >> complete a make buildworld yet. Crashes with a wide variety of >> errors. Pop in the NT drive, works fine. FreeBSD crash. >> Just about to shoot the damn thing, and...] > [ ... memory problems ... ] > One wonders what NT wasn't telling you... if it's bad, it's bad. > I think maybe the difference was that under NT is was undetectably > bad. Which is bad. Could be happenstance. For instance, suppose that one bit (bit A), when undergoing a transition from 0 to 1, causes an column-adjacent bit (bit B) to become stuck at 1. Now, suppose that his NT kernel loads at an address such that part of the code includes keeping bit A at 0. This would mean that since bit A has never undergone its fatal transition, then bit B continues to work perfectly. Recall that the same thing could have happened under FreeBSD, depending on where the nails land. Just a consideration. I've also had machines which would apparently work perfectly well until I load more than n devices, at which point they would fatally fail (the nth device happened to, in each of my test cases, cause the bad RAM to cover I/O buffers, whereas before it was covering unused memory), or machines that had worked fine for years under Win3.1, but died horribly the minute we installed Win95 (frequently taking the system registry with them, meaning I had to do a complete OS reinstall... what's that saying about eggs and baskets?). My point is that just because something works, doesn't mean it's a good component, just as when swapping a component makes the system work, doesn't mean the component is at fault. I still use RAM tester programs, simply because, in my experience, they tend to find faulty RAM more reliably than *any* other method I've used. (It also helps to use a good RAM tester (I like Qualitas RAMExam), rather than some dink that just sets all bits zero then all bits 1.) Would other people here like to see a boot-sector RAM tester? -- Joel Ray Holveck - joelh@gnu.org - http://www.wp.com/piquan Fourth law of programming: Anything that can go wrong wi sendmail: segmentation violation - core dumped