From owner-freebsd-hackers@FreeBSD.ORG Thu May 6 13:11:42 2010 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 896AB1065675 for ; Thu, 6 May 2010 13:11:42 +0000 (UTC) (envelope-from aduane@juniper.net) Received: from exprod7og121.obsmtp.com (exprod7og121.obsmtp.com [64.18.2.20]) by mx1.freebsd.org (Postfix) with ESMTP id 90CD68FC0C for ; Thu, 6 May 2010 13:11:30 +0000 (UTC) Received: from source ([66.129.224.36]) (using TLSv1) by exprod7ob121.postini.com ([64.18.6.12]) with SMTP ID DSNKS+LAAXe+fUNIQRjzc/pkXrocH4eMvPt4@postini.com; Thu, 06 May 2010 06:11:42 PDT Received: from p-emfe01-wf.jnpr.net (172.28.145.24) by P-EMHUB03-HQ.jnpr.net (172.24.192.37) with Microsoft SMTP Server (TLS) id 8.1.436.0; Thu, 6 May 2010 06:10:17 -0700 Received: from EMBX01-WF.jnpr.net ([fe80::1914:3299:33d9:e43b]) by p-emfe01-wf.jnpr.net ([fe80::d0d1:653d:5b91:a123%11]) with mapi; Thu, 6 May 2010 09:10:16 -0400 From: Andrew Duane To: Atom Smasher , "freebsd-hackers@freebsd.org" Date: Thu, 6 May 2010 09:10:16 -0400 Thread-Topic: bad RAM? prove it with a crash dump? Thread-Index: AcrtFAtSlQJAg8rpSHCgyBEF5s5aZwACNdVg Message-ID: References: <1005062053260.2629@smasher> <4BE2A3A1.5030805@acm.poly.edu> <1005062327340.2629@smasher> In-Reply-To: <1005062327340.2629@smasher> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Cc: Subject: RE: bad RAM? prove it with a crash dump? X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 06 May 2010 13:11:42 -0000 owner-freebsd-hackers@freebsd.org wrote: > On Thu, 6 May 2010, Boris Kochergin wrote: >=20 >> My experience with bad memory is that if it causes the machine to >> crash, it won't always happen while the machine is running the same >> process (or kernel thread)--so look for it crashing in a wide >> variety of places--and upon inspection of the core dump, a pointer >> somewhere will be pointing to garbage. > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > so really i'd need to collect two or more crash dumps, and if they > point to different addresses then i can reasonably say the RAM is bad? >=20 > thanks... It's not just that they point to different addresses, it is garbage in many= completely independent places. For example, pulling bad registers/return a= ddresses off the stack, or garbage in random unrelated buffers/structures/p= ointers. On the other hand, if you often have garbage in some structure's "= foo" pointer, that indicates a problem (maybe locking) in how your code man= ages setting that foo pointer. It's a subtle difference. It is also useful to make sure that the garbage itself is different. As men= tioned before, a single bit error in an otherwise valid value, or maybe a m= issing/scrambled byte, these are good indications of memory problems. If ra= ndom places are often overwritten with something else, that could just be a= nother piece of misbehaving code that is writing someplace it shouldn't. I'= ve often found code that writes some buffer into e.g. a piece of memory it = no longer owns that looks like memory corruption until you realize the garb= age is always something specific like a vnode structure. /Andrew