From owner-freebsd-stable@FreeBSD.ORG Fri Jan 18 20:23:11 2013 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 18139EA4; Fri, 18 Jan 2013 20:23:11 +0000 (UTC) (envelope-from wblock@wonkity.com) Received: from wonkity.com (wonkity.com [67.158.26.137]) by mx1.freebsd.org (Postfix) with ESMTP id B40BE607; Fri, 18 Jan 2013 20:23:10 +0000 (UTC) Received: from wonkity.com (localhost [127.0.0.1]) by wonkity.com (8.14.6/8.14.6) with ESMTP id r0IKN2vh001865; Fri, 18 Jan 2013 13:23:02 -0700 (MST) (envelope-from wblock@wonkity.com) Received: from localhost (wblock@localhost) by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r0IKN0OW001862; Fri, 18 Jan 2013 13:23:01 -0700 (MST) (envelope-from wblock@wonkity.com) Date: Fri, 18 Jan 2013 13:23:00 -0700 (MST) From: Warren Block To: kpneal@pobox.com Subject: Re: Spontaneous reboots on Intel i5 and FreeBSD 9.0 In-Reply-To: <20130118173602.GA76438@neutralgood.org> Message-ID: References: <1358527685.32417.237.camel@revolution.hippie.lan> <20130118173602.GA76438@neutralgood.org> User-Agent: Alpine 2.00 (BSF 1167 2008-08-23) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (wonkity.com [127.0.0.1]); Fri, 18 Jan 2013 13:23:02 -0700 (MST) Cc: freebsd-stable@FreeBSD.org, Ian Lepore , Ronald Klop X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 18 Jan 2013 20:23:11 -0000 On Fri, 18 Jan 2013, kpneal@pobox.com wrote: > On Fri, Jan 18, 2013 at 09:48:05AM -0700, Ian Lepore wrote: >> I tend to agree, a machine that starts rebooting spontaneously when >> nothing significant changed and it used to be stable is usually a sign >> of a failing power supply or memory. > > Agreed. > >> But I disagree about memtest86. It's probably not completely without >> value, but to me its value is only negative: if it tells you memory is >> bad, it is. If it tells you it's good, you know nothing. Over the >> years I've had 5 dimms fail. memtest86 found the error in one of them, >> but said all the others were fine in continuous 48-hour tests. I even >> tried running the tests on multiple systems. >> >> The thing that always reliably finds bad memory for me >> is /usr/ports/math/mprime run in test/benchmark mode. It often takes 24 >> or more hours of runtime, but it will find your bad memory. > > I've had "good" luck with gcc showing bad memory. If compiling a new kernel > produces seg faults then I know I have a hardware problem. I've seen > compilers at work failing due to bad memory as well. > > Some problems only happen with particular access patterns. So if a compiler > works fine then, like memtest86, it doesn't say anything about the health > of the hardware. Most test tools are like that. They might diagnose something as bad, but they often can't prove it is good. SMART has a reputation for not finding any problems on disks that are failing, and capacitors that aren't swollen or leaking still may not be working. But diagnostic tools can at least give a hint. In my case, memtest indicated a problem--a big problem. I removed one DIMM at random (there were only two) and the problems and memtest errors both went away. Replace the DIMM, and both came back.