From owner-freebsd-hackers  Thu Jan 22 00:02:15 1998
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Received: (from majordom@localhost)
          by hub.freebsd.org (8.8.8/8.8.8) id AAA23488
          for hackers-outgoing; Thu, 22 Jan 1998 00:02:15 -0800 (PST)
          (envelope-from owner-freebsd-hackers@FreeBSD.ORG)
Received: from wcc.wcc.net (wcc.wcc.net [208.6.232.10])
          by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id AAA23478
          for <hackers@FreeBSD.ORG>; Thu, 22 Jan 1998 00:02:11 -0800 (PST)
          (envelope-from piquan@wcc.wcc.net)
Received: from detlev.UUCP (newip44.wcc.net [206.104.247.44])
	by wcc.wcc.net (8.8.7/8.8.7) with ESMTP id BAA05285;
	Thu, 22 Jan 1998 01:58:33 -0600 (CST)
Received: (from joelh@localhost)
	by detlev.UUCP (8.8.8/8.8.7) id CAA00609;
	Thu, 22 Jan 1998 02:01:48 -0600 (CST)
	(envelope-from joelh)
Date: Thu, 22 Jan 1998 02:01:48 -0600 (CST)
Message-Id: <199801220801.CAA00609@detlev.UUCP>
To: tlambert@primenet.com
CC: mrcpu@cdsnet.net, hackers@FreeBSD.ORG
In-reply-to: <199801220637.XAA07251@usr09.primenet.com> (message from Terry
	Lambert on Thu, 22 Jan 1998 06:37:03 +0000 (GMT))
Subject: Re: Had the shotgun out and pointed at my -current/SMP box...
From: Joel Ray Holveck <joelh@gnu.org>
Reply-to: joelh@gnu.org
References:  <199801220637.XAA07251@usr09.primenet.com>
Sender: owner-freebsd-hackers@FreeBSD.ORG
Precedence: bulk


>>    [ I am messing around with a 3 processor P6/233 system to potentially
>> 	do some heavy-duty database work, and it hasn't been able to 
>> 	complete a make buildworld yet.  Crashes with a wide variety of
>> 	errors.  Pop in the NT drive, works fine.  FreeBSD crash.
>> 	Just about to shoot the damn thing, and...]
> [ ... memory problems ... ]
> One wonders what NT wasn't telling you... if it's bad, it's bad.
> I think maybe the difference was that under NT is was undetectably
> bad.  Which is bad.

Could be happenstance.  For instance, suppose that one bit (bit A),
when undergoing a transition from 0 to 1, causes an column-adjacent
bit (bit B) to become stuck at 1.  Now, suppose that his NT kernel
loads at an address such that part of the code includes keeping bit A
at 0.  This would mean that since bit A has never undergone its
fatal transition, then bit B continues to work perfectly.

Recall that the same thing could have happened under FreeBSD,
depending on where the nails land.

Just a consideration.  I've also had machines which would apparently
work perfectly well until I load more than n devices, at which point
they would fatally fail (the nth device happened to, in each of my
test cases, cause the bad RAM to cover I/O buffers, whereas before it
was covering unused memory), or machines that had worked fine for
years under Win3.1, but died horribly the minute we installed Win95
(frequently taking the system registry with them, meaning I had to do
a complete OS reinstall... what's that saying about eggs and
baskets?).

My point is that just because something works, doesn't mean it's a
good component, just as when swapping a component makes the system
work, doesn't mean the component is at fault.  I still use RAM tester
programs, simply because, in my experience, they tend to find faulty
RAM more reliably than *any* other method I've used.  (It also helps
to use a good RAM tester (I like Qualitas RAMExam), rather than some
dink that just sets all bits zero then all bits 1.)

Would other people here like to see a boot-sector RAM tester?

-- 
Joel Ray Holveck - joelh@gnu.org - http://www.wp.com/piquan
   Fourth law of programming:
   Anything that can go wrong wi
sendmail: segmentation violation - core dumped