From owner-freebsd-stable  Thu Sep 21  9:40:47 2000
Delivered-To: freebsd-stable@freebsd.org
Received: from mail.wolves.k12.mo.us (mail.wolves.k12.mo.us [207.160.214.1])
	by hub.freebsd.org (Postfix) with ESMTP id 01E4137B43E
	for <stable@FreeBSD.ORG>; Thu, 21 Sep 2000 09:40:43 -0700 (PDT)
Received: from mail.wolves.k12.mo.us (cdillon@mail.wolves.k12.mo.us [207.160.214.1])
	by mail.wolves.k12.mo.us (8.9.3/8.9.3) with ESMTP id LAA30831;
	Thu, 21 Sep 2000 11:40:32 -0500 (CDT)
	(envelope-from cdillon@wolves.k12.mo.us)
Date: Thu, 21 Sep 2000 11:40:32 -0500 (CDT)
From: Chris Dillon <cdillon@wolves.k12.mo.us>
To: Michael Allman <msa@dinosaur.umbc.edu>
Cc: BSD <bsd@shell-server.com>, stable@FreeBSD.ORG
Subject: Re: Constant panics on 4.1-STABLE!
In-Reply-To: <Pine.BSF.4.21.0009211110090.16759-100000@dinosaur.umbc.edu>
Message-ID: <Pine.BSF.4.21.0009211125170.27801-100000@mail.wolves.k12.mo.us>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-stable@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG

On Thu, 21 Sep 2000, Michael Allman wrote:

> On Thu, 21 Sep 2000, Chris Dillon wrote:
> 
> > On Wed, 20 Sep 2000, BSD wrote:
> > 
> > > 	Are you saying all 3 sticks are bad at 133MHz (KA7) and one
> > > or more is bad at 66MHz (BP6)?  The likelihood of that is
> > > extremely small.
> > 
> > You said you were running a PIII on that BP6.  Therefore, you would
> > have to be running it at either 100MHz or 133MHz (which the BX doesn't
> > officially support, but it works pretty well anyway).
> > 
> > Also, it might not be that the memory is bad, but just out of spec for
> > your systems.  For example, if your system is expecting RAM that it
> > can use CAS2 timings with but you have CAS3 RAM, that is going to
> > cause problems.  If this is the case, the EEPROMs on the sticks might
> > be programmed with incorrect timing information.  Tell your systems to
> > ignore the EEPROM (SPD), and try manually setting the most
> > conservative memory timings you can in each of your systems.
> 
> I am having problems with random panics/reboots as well.  I am using two
> sticks of Corsair 128MB ECC memory.  My motherboard uses the GX chipset.  
> Crashes occur when I am using both sticks and one or the other stick.  
> Considering that I have been using this memory reliably for about a year I
> find it hard to believe that both sticks would go bad simultaneously.  I
> have been using CAS3, ECC settings in my bios.

It probably isn't the memory, then (Corsair is pretty good).

> > > Also, a 512MB stick of RAM would cost me $1,600CAD.  Sigh.  
> > > That's not going to happen anytime soon.  Furthermore, I stress
> > > tested each stick of RAM, with make -j64 buildworld.  Nothing
> > > failed there.  The panics happenned when the system was just doing
> > > its normal tasks.  I'll try to post more detailed reports
> > > (including crash dumps).
> > 
> > A 'make world' is a pretty good way to stress-test things, but its far
> > from perfect.  I've had flaky systems survive multiple 'make world'
> > sessions but still fail unexpectedly at other times.
> 
> My experience so far has been that the panics occur independently of
> system load.  Also, I often do not get a crash dump, even though I have my
> system configured for that.
> 
> > BTW, crash dumps will be meaningless if this really is a hardware
> > problem.
> 
> Equivalent to this statement is the following.  If the crash dumps are not
> meaningless (meaningful?), then this is not a hardware problem.  I would
> say it is still worthwhile to look at crash dumps.

Wrong.  You have no way of knowing just by looking at a crashdump if
the problem was caused by random memory corruption, CPU flakyness, or
whatever, or if it was a real software problem.  Crashdumps are only
useful if you _know_ flaky hardware wasn't the culprit.  If you hand a
developer a crashdump caused by hardware flakyness, you are going to
send them on a wild goose-chase and they will never find a real
problem with the code where the failure supposedly occurred.  If
they're really lucky, they'll look at a crashdump and say "It is not
at all possible for this to have happened because of software.  It
must have been caused by hardware".  I wouldn't put that burden on any
of these developers, however.  This has already happened at least a
few times, and usually the developer wastes days or weeks looking for
a non-existent problem until the original finder of the problem comes
back and says "Duh, I'm REALLY sorry guys, but I found the culprit, it
was my hardware".  You can find at least a few of these archived in
our mailing lists.

> > These kinds of problems are exactly why I spend the few extra bucks to
> > buy ECC RAM for my important systems, even my workstation at home.  
> > Its worth it.  If I have problems and I have ECC enabled, I can be
> > fairly sure it isn't the RAM.  Usually I just enable EC
> > (Error-checking only) on my system at home, so if I start getting a
> > lot of NMI panics I know that my memory is starting to flake out on
> > me, at which point I can turn on ECC and start shopping for new
> > memory.  So far, I've never gotten one.  This is probably due to me
> > running PC133 memory on only a 66MHz bus. :-)
> 
> I have ECC RAM with ECC enabled.  I get crashes anyway.  Would you say
> then that it's not the RAM?

Then it most likely isn't the RAM.  That does not, however, rule out
the CPU, support chipsets, or even a weird expansion card that is
spewing enough RF noise to cause data corruption on nearby devices.


-- Chris Dillon - cdillon@wolves.k12.mo.us - cdillon@inter-linc.net
   FreeBSD: The fastest and most stable server OS on the planet.
   For Intel x86 and Alpha architectures. ( http://www.freebsd.org )


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-stable" in the body of the message