From owner-freebsd-hackers@FreeBSD.ORG  Thu May  6 13:11:42 2010
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 896AB1065675
	for <freebsd-hackers@freebsd.org>; Thu,  6 May 2010 13:11:42 +0000 (UTC)
	(envelope-from aduane@juniper.net)
Received: from exprod7og121.obsmtp.com (exprod7og121.obsmtp.com [64.18.2.20])
	by mx1.freebsd.org (Postfix) with ESMTP id 90CD68FC0C
	for <freebsd-hackers@freebsd.org>; Thu,  6 May 2010 13:11:30 +0000 (UTC)
Received: from source ([66.129.224.36]) (using TLSv1) by
	exprod7ob121.postini.com ([64.18.6.12]) with SMTP
	ID DSNKS+LAAXe+fUNIQRjzc/pkXrocH4eMvPt4@postini.com;
	Thu, 06 May 2010 06:11:42 PDT
Received: from p-emfe01-wf.jnpr.net (172.28.145.24) by P-EMHUB03-HQ.jnpr.net
	(172.24.192.37) with Microsoft SMTP Server (TLS) id 8.1.436.0;
	Thu, 6 May 2010 06:10:17 -0700
Received: from EMBX01-WF.jnpr.net ([fe80::1914:3299:33d9:e43b]) by
	p-emfe01-wf.jnpr.net ([fe80::d0d1:653d:5b91:a123%11]) with mapi;
	Thu, 6 May 2010 09:10:16 -0400
From: Andrew Duane <aduane@juniper.net>
To: Atom Smasher <atom@smasher.org>, "freebsd-hackers@freebsd.org"
	<freebsd-hackers@freebsd.org>
Date: Thu, 6 May 2010 09:10:16 -0400
Thread-Topic: bad RAM? prove it with a crash dump?
Thread-Index: AcrtFAtSlQJAg8rpSHCgyBEF5s5aZwACNdVg
Message-ID: <AC6674AB7BC78549BB231821ABF7A9AE903D986659@EMBX01-WF.jnpr.net>
References: <1005062053260.2629@smasher> <4BE2A3A1.5030805@acm.poly.edu>
	<1005062327340.2629@smasher>
In-Reply-To: <1005062327340.2629@smasher>
Accept-Language: en-US
Content-Language: en-US
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
Cc: 
Subject: RE: bad RAM? prove it with a crash dump?
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 06 May 2010 13:11:42 -0000

owner-freebsd-hackers@freebsd.org wrote:
> On Thu, 6 May 2010, Boris Kochergin wrote:
>=20
>> My experience with bad memory is that if it causes the machine to
>> crash, it won't always happen while the machine is running the same
>> process (or kernel thread)--so look for it crashing in a wide
>> variety of places--and upon inspection of the core dump, a pointer
>> somewhere will be pointing to garbage.
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>=20
> so really i'd need to collect two or more crash dumps, and if they
> point to different addresses then i can reasonably say the RAM is bad?
>=20
> thanks...

It's not just that they point to different addresses, it is garbage in many=
 completely independent places. For example, pulling bad registers/return a=
ddresses off the stack, or garbage in random unrelated buffers/structures/p=
ointers. On the other hand, if you often have garbage in some structure's "=
foo" pointer, that indicates a problem (maybe locking) in how your code man=
ages setting that foo pointer. It's a subtle difference.

It is also useful to make sure that the garbage itself is different. As men=
tioned before, a single bit error in an otherwise valid value, or maybe a m=
issing/scrambled byte, these are good indications of memory problems. If ra=
ndom places are often overwritten with something else, that could just be a=
nother piece of misbehaving code that is writing someplace it shouldn't. I'=
ve often found code that writes some buffer into e.g. a piece of memory it =
no longer owns that looks like memory corruption until you realize the garb=
age is always something specific like a vnode structure.

/Andrew