From owner-freebsd-stable Thu Sep 21 9:40:47 2000 Delivered-To: freebsd-stable@freebsd.org Received: from mail.wolves.k12.mo.us (mail.wolves.k12.mo.us [207.160.214.1]) by hub.freebsd.org (Postfix) with ESMTP id 01E4137B43E for ; Thu, 21 Sep 2000 09:40:43 -0700 (PDT) Received: from mail.wolves.k12.mo.us (cdillon@mail.wolves.k12.mo.us [207.160.214.1]) by mail.wolves.k12.mo.us (8.9.3/8.9.3) with ESMTP id LAA30831; Thu, 21 Sep 2000 11:40:32 -0500 (CDT) (envelope-from cdillon@wolves.k12.mo.us) Date: Thu, 21 Sep 2000 11:40:32 -0500 (CDT) From: Chris Dillon To: Michael Allman Cc: BSD , stable@FreeBSD.ORG Subject: Re: Constant panics on 4.1-STABLE! In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-stable@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On Thu, 21 Sep 2000, Michael Allman wrote: > On Thu, 21 Sep 2000, Chris Dillon wrote: > > > On Wed, 20 Sep 2000, BSD wrote: > > > > > Are you saying all 3 sticks are bad at 133MHz (KA7) and one > > > or more is bad at 66MHz (BP6)? The likelihood of that is > > > extremely small. > > > > You said you were running a PIII on that BP6. Therefore, you would > > have to be running it at either 100MHz or 133MHz (which the BX doesn't > > officially support, but it works pretty well anyway). > > > > Also, it might not be that the memory is bad, but just out of spec for > > your systems. For example, if your system is expecting RAM that it > > can use CAS2 timings with but you have CAS3 RAM, that is going to > > cause problems. If this is the case, the EEPROMs on the sticks might > > be programmed with incorrect timing information. Tell your systems to > > ignore the EEPROM (SPD), and try manually setting the most > > conservative memory timings you can in each of your systems. > > I am having problems with random panics/reboots as well. I am using two > sticks of Corsair 128MB ECC memory. My motherboard uses the GX chipset. > Crashes occur when I am using both sticks and one or the other stick. > Considering that I have been using this memory reliably for about a year I > find it hard to believe that both sticks would go bad simultaneously. I > have been using CAS3, ECC settings in my bios. It probably isn't the memory, then (Corsair is pretty good). > > > Also, a 512MB stick of RAM would cost me $1,600CAD. Sigh. > > > That's not going to happen anytime soon. Furthermore, I stress > > > tested each stick of RAM, with make -j64 buildworld. Nothing > > > failed there. The panics happenned when the system was just doing > > > its normal tasks. I'll try to post more detailed reports > > > (including crash dumps). > > > > A 'make world' is a pretty good way to stress-test things, but its far > > from perfect. I've had flaky systems survive multiple 'make world' > > sessions but still fail unexpectedly at other times. > > My experience so far has been that the panics occur independently of > system load. Also, I often do not get a crash dump, even though I have my > system configured for that. > > > BTW, crash dumps will be meaningless if this really is a hardware > > problem. > > Equivalent to this statement is the following. If the crash dumps are not > meaningless (meaningful?), then this is not a hardware problem. I would > say it is still worthwhile to look at crash dumps. Wrong. You have no way of knowing just by looking at a crashdump if the problem was caused by random memory corruption, CPU flakyness, or whatever, or if it was a real software problem. Crashdumps are only useful if you _know_ flaky hardware wasn't the culprit. If you hand a developer a crashdump caused by hardware flakyness, you are going to send them on a wild goose-chase and they will never find a real problem with the code where the failure supposedly occurred. If they're really lucky, they'll look at a crashdump and say "It is not at all possible for this to have happened because of software. It must have been caused by hardware". I wouldn't put that burden on any of these developers, however. This has already happened at least a few times, and usually the developer wastes days or weeks looking for a non-existent problem until the original finder of the problem comes back and says "Duh, I'm REALLY sorry guys, but I found the culprit, it was my hardware". You can find at least a few of these archived in our mailing lists. > > These kinds of problems are exactly why I spend the few extra bucks to > > buy ECC RAM for my important systems, even my workstation at home. > > Its worth it. If I have problems and I have ECC enabled, I can be > > fairly sure it isn't the RAM. Usually I just enable EC > > (Error-checking only) on my system at home, so if I start getting a > > lot of NMI panics I know that my memory is starting to flake out on > > me, at which point I can turn on ECC and start shopping for new > > memory. So far, I've never gotten one. This is probably due to me > > running PC133 memory on only a 66MHz bus. :-) > > I have ECC RAM with ECC enabled. I get crashes anyway. Would you say > then that it's not the RAM? Then it most likely isn't the RAM. That does not, however, rule out the CPU, support chipsets, or even a weird expansion card that is spewing enough RF noise to cause data corruption on nearby devices. -- Chris Dillon - cdillon@wolves.k12.mo.us - cdillon@inter-linc.net FreeBSD: The fastest and most stable server OS on the planet. For Intel x86 and Alpha architectures. ( http://www.freebsd.org ) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-stable" in the body of the message