From owner-freebsd-questions@FreeBSD.ORG  Sun Feb 22 15:48:37 2015
Return-Path: <owner-freebsd-questions@FreeBSD.ORG>
Delivered-To: freebsd-questions@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id EFBDB4CC
 for <freebsd-questions@freebsd.org>; Sun, 22 Feb 2015 15:48:36 +0000 (UTC)
Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.52.97])
 by mx1.freebsd.org (Postfix) with ESMTP id AD090C4
 for <freebsd-questions@freebsd.org>; Sun, 22 Feb 2015 15:48:36 +0000 (UTC)
Received: by cosmo.uchicago.edu (Postfix, from userid 48)
 id 8B9F9CB8C9F; Sun, 22 Feb 2015 09:48:30 -0600 (CST)
Received: from 76.193.19.10 (SquirrelMail authenticated user valeri)
 by cosmo.uchicago.edu with HTTP;
 Sun, 22 Feb 2015 09:48:30 -0600 (CST)
Message-ID: <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu>
In-Reply-To: <20150222104425.GA44573@home.parts-unknown.org>
References: <20150221224006.GA5501@home.parts-unknown.org>
 <09da5ec0816e098badc49432c802dc18@sdf.org>
 <390c4c0547fc27e91d28872d29aa2e04@sdf.org>
 <20150222091956.fd1ec914.freebsd@edvax.de>
 <20150222104425.GA44573@home.parts-unknown.org>
Date: Sun, 22 Feb 2015 09:48:30 -0600 (CST)
Subject: Re: why would I get a segmentation fault on one system but not the 
 other?
From: "Valeri Galtsev" <galtsev@kicp.uchicago.edu>
To: "David Benfell" <benfell@parts-unknown.org>
Reply-To: galtsev@kicp.uchicago.edu
User-Agent: SquirrelMail/1.4.8-5.el5.centos.7
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Priority: 3 (Normal)
Importance: Normal
Cc: cpet <cpet@sdf.org>, Polytropon <freebsd@edvax.de>,
 freebsd-questions@freebsd.org
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 22 Feb 2015 15:48:37 -0000


On Sun, February 22, 2015 4:44 am, David Benfell wrote:
> On Sun, Feb 22, 2015 at 09:19:56AM +0100, Polytropon wrote:
>> On Sat, 21 Feb 2015 17:03:50 -0600, cpet wrote:
>> > As well as don't use stable on a production box as STABLE doesn't mean
>> > what it means.
>>
>> STABLE means that the API/ABI is stable. Unlike HEAD (CURRENT),
>> STABLE still is actually _stable_ in most cases, so it's a valid
>> solution for production systems (given that you're prepared well,
>> and you know what you're doing). I'm running STABLE on few
>> production machines myself (where this is needed), but I usually
>> prefer (and often recommend) using RELEASE and add the security
>> patches when they are available.
>>
> Thinking about this more, I'm inclined to think my problem is not with
> the base system. I haven't seen *any* crashes with stuff that can be
> clearly identified as being in the base system, let alone the kernel.
>
> My memory test has just completed a 4th pass with zero errors. It's
> now been running for 7.5 hours.
>

How long does the box run before segfault? Some memory errors may happen
with smaller probability, then short memtest may be OK, not detecting
memory errors happening less often.

What is the load of machine when segfault happens? During memtest86 the
load is "zero". During actual server run, you may be heating the interior
of the box to higher temperatures, namely memory controller to higher
temperatures, which increases chance of malfunction.

Do you have ECC memory or non-ECC? If non-ECC can you replace it with ECC?
(some memory controllers accept both). Is it possible that you have
mixture of different types of RAM attached to the same memory controller
(I've seen even different brands claiming the same specs did cause
occasional malfunctions). Also, which slots do you use for RAM? If not all
slots have RAM, start filling the slots that are farther away from memory
controller (which is on CPU substrate these days, hence from CPU). If you
leave fartherst slots open you will have open (not terminated) portion of
transmission line causing reflections interfering with signal, leading to
trouble. Some fancy system boards do have memory bus terminators so what I
said about slots deasn't matter for them, but majority of boards do not.
If the hardware is a suspect, I would begin with minimal amount of known
good RAM.

Swapping RAM between good and bad machines is another thing to try. I
however, would try instead to swap hard drives, and see which of machines
will start failing after that. This way you will know for sure if software
(+ hard drive) is to blame (if different machine starts failing) or
hardware (if the same machine with system from good machine keeps
failing).

Goog luck!

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++