From owner-freebsd-doc  Fri Feb 18 11:10:58 2000
Delivered-To: freebsd-doc@freebsd.org
Received: from zippy.cdrom.com (zippy.cdrom.com [204.216.27.228])
	by hub.freebsd.org (Postfix) with ESMTP
	id 3E7E637B9D9; Fri, 18 Feb 2000 11:10:55 -0800 (PST)
	(envelope-from jkh@zippy.cdrom.com)
Received: from zippy.cdrom.com (jkh@localhost [127.0.0.1])
	by zippy.cdrom.com (8.9.3/8.9.3) with ESMTP id LAA79459;
	Fri, 18 Feb 2000 11:10:38 -0800 (PST)
	(envelope-from jkh@zippy.cdrom.com)
To: bgingery@gtcs.com
Cc: freebsd-doc@FreeBSD.ORG, freebsd-hackers@FreeBSD.ORG
Subject: Re: Recommended addition to FAQ (Troubleshooting) 
In-reply-to: Your message of "Fri, 18 Feb 2000 09:59:20 MST."
             <200002181659.JAA28578@ home.gtcs.com> 
Date: Fri, 18 Feb 2000 11:10:38 -0800
Message-ID: <79456.950901038@zippy.cdrom.com>
From: "Jordan K. Hubbard" <jkh@zippy.cdrom.com>
Sender: owner-freebsd-doc@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

The situation here, I hate to say, is that you were simply very lucky
in having a software memory tester show you anything at all.

If your experience had been more typical, you would have run memtest86
and it would have declared your memory to be free of errors.  Then
you'd have gone right on having problems and losing more hair until
you finally just came back to the memory and swapped it out on
suspicion.  Bingo, the problems are suddenly fixed and you're dragging
memtest86 to KDE's trashcan with a resolve to never trust it again.

The reason why software memory testers are so generally ineffectual is
that there's a whole bunch of things getting in their way.  Leaving
aside for the moment the nasty problem of having your memory checker
loaded into the bad memory in question, the cache also seriously gets
in your way (and I'll bet you never even thought to turn both L1 and
L2 caches off, did you? :-) by masking errors in a way which is
transparent to the program.  How's it supposed to know its accesses
are getting cached or how much cache it has to "defeat" to get
meaningful access to main memory?  It can't, really, at least not in a
way that's truly foolproof or workable across the entire range of
Intel/AMD CPUs it might be run on, and that's why serious bench techs
use hardware memory testers exclusively.

I've used all kinds of software memory checkers, from "CheckIt" to
highly proprietary packages that cost even more money, and the only
thing they managed to convince me of is that swapping in known-good
memory is the best and fastest way out of these situations.  Unless
you have a hardware memory tester available, trying to test it
yourself is just too likely to give you a false sense of security and
send you down more blind alleys.  I've even put known BAD memory into
boxes and had CheckIt tell me "looks good to me, boss!", just to
verify my suspicion that it had lied to me before.  It's also very
slow to run a software memory tester without the caches enabled and
swapping the memory is generally a whole lot faster than that.  I'm
impatient. :)

So, to summarize, I am actually somewhat against the idea of including
tools like this on the grounds that they can also help to convince
people of the wrong things while they're debugging a problem.  I also
don't look forward to having to argue with users who've just run such
tests and are still getting signal 11's but now refuse to believe that
the memory could be bad because "they checked it."  If I then turn
around and tell them not to trust the tool I also stuck on the CD for
them, they're going to ask why I put it there in the first place and a
nice long argument will then ensue instead of us just replacing that
damn memory and moving on. :-)

- Jordan


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-doc" in the body of the message