From owner-freebsd-stable@FreeBSD.ORG  Fri Jan 18 20:23:11 2013
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 18139EA4;
 Fri, 18 Jan 2013 20:23:11 +0000 (UTC)
 (envelope-from wblock@wonkity.com)
Received: from wonkity.com (wonkity.com [67.158.26.137])
 by mx1.freebsd.org (Postfix) with ESMTP id B40BE607;
 Fri, 18 Jan 2013 20:23:10 +0000 (UTC)
Received: from wonkity.com (localhost [127.0.0.1])
 by wonkity.com (8.14.6/8.14.6) with ESMTP id r0IKN2vh001865;
 Fri, 18 Jan 2013 13:23:02 -0700 (MST)
 (envelope-from wblock@wonkity.com)
Received: from localhost (wblock@localhost)
 by wonkity.com (8.14.6/8.14.6/Submit) with ESMTP id r0IKN0OW001862;
 Fri, 18 Jan 2013 13:23:01 -0700 (MST)
 (envelope-from wblock@wonkity.com)
Date: Fri, 18 Jan 2013 13:23:00 -0700 (MST)
From: Warren Block <wblock@wonkity.com>
To: kpneal@pobox.com
Subject: Re:  Spontaneous reboots on Intel i5 and FreeBSD 9.0
In-Reply-To: <20130118173602.GA76438@neutralgood.org>
Message-ID: <alpine.BSF.2.00.1301181313560.1604@wonkity.com>
References: <CAJ-UWtSANRMsOqwW9rJ6Eebta6=AiHeNO6fhPO0mhYhZiMmn4A@mail.gmail.com>
 <op.wq3zxn038527sy@ronaldradial.versatec.local>
 <alpine.BSF.2.00.1301180758460.96418@wonkity.com>
 <1358527685.32417.237.camel@revolution.hippie.lan>
 <20130118173602.GA76438@neutralgood.org>
User-Agent: Alpine 2.00 (BSF 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (wonkity.com [127.0.0.1]); Fri, 18 Jan 2013 13:23:02 -0700 (MST)
Cc: freebsd-stable@FreeBSD.org, Ian Lepore <ian@FreeBSD.org>,
 Ronald Klop <ronald-freebsd8@klop.yi.org>
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 18 Jan 2013 20:23:11 -0000

On Fri, 18 Jan 2013, kpneal@pobox.com wrote:

> On Fri, Jan 18, 2013 at 09:48:05AM -0700, Ian Lepore wrote:
>> I tend to agree, a machine that starts rebooting spontaneously when
>> nothing significant changed and it used to be stable is usually a sign
>> of a failing power supply or memory.
>
> Agreed.
>
>> But I disagree about memtest86.  It's probably not completely without
>> value, but to me its value is only negative:  if it tells you memory is
>> bad, it is.  If it tells you it's good, you know nothing.  Over the
>> years I've had 5 dimms fail.  memtest86 found the error in one of them,
>> but said all the others were fine in continuous 48-hour tests.  I even
>> tried running the tests on multiple systems.
>>
>> The thing that always reliably finds bad memory for me
>> is /usr/ports/math/mprime run in test/benchmark mode.  It often takes 24
>> or more hours of runtime, but it will find your bad memory.
>
> I've had "good" luck with gcc showing bad memory. If compiling a new kernel
> produces seg faults then I know I have a hardware problem. I've seen
> compilers at work failing due to bad memory as well.
>
> Some problems only happen with particular access patterns.  So if a compiler
> works fine then, like memtest86, it doesn't say anything about the health
> of the hardware.

Most test tools are like that.  They might diagnose something as bad, 
but they often can't prove it is good.  SMART has a reputation for not 
finding any problems on disks that are failing, and capacitors that 
aren't swollen or leaking still may not be working.

But diagnostic tools can at least give a hint.  In my case, memtest 
indicated a problem--a big problem.  I removed one DIMM at random (there 
were only two) and the problems and memtest errors both went away. 
Replace the DIMM, and both came back.