From owner-freebsd-questions@FreeBSD.ORG Sun Feb 22 15:48:37 2015 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id EFBDB4CC for ; Sun, 22 Feb 2015 15:48:36 +0000 (UTC) Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.52.97]) by mx1.freebsd.org (Postfix) with ESMTP id AD090C4 for ; Sun, 22 Feb 2015 15:48:36 +0000 (UTC) Received: by cosmo.uchicago.edu (Postfix, from userid 48) id 8B9F9CB8C9F; Sun, 22 Feb 2015 09:48:30 -0600 (CST) Received: from 76.193.19.10 (SquirrelMail authenticated user valeri) by cosmo.uchicago.edu with HTTP; Sun, 22 Feb 2015 09:48:30 -0600 (CST) Message-ID: <9134.76.193.19.10.1424620110.squirrel@cosmo.uchicago.edu> In-Reply-To: <20150222104425.GA44573@home.parts-unknown.org> References: <20150221224006.GA5501@home.parts-unknown.org> <09da5ec0816e098badc49432c802dc18@sdf.org> <390c4c0547fc27e91d28872d29aa2e04@sdf.org> <20150222091956.fd1ec914.freebsd@edvax.de> <20150222104425.GA44573@home.parts-unknown.org> Date: Sun, 22 Feb 2015 09:48:30 -0600 (CST) Subject: Re: why would I get a segmentation fault on one system but not the other? From: "Valeri Galtsev" To: "David Benfell" Reply-To: galtsev@kicp.uchicago.edu User-Agent: SquirrelMail/1.4.8-5.el5.centos.7 MIME-Version: 1.0 Content-Type: text/plain;charset=iso-8859-1 Content-Transfer-Encoding: 8bit X-Priority: 3 (Normal) Importance: Normal Cc: cpet , Polytropon , freebsd-questions@freebsd.org X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Feb 2015 15:48:37 -0000 On Sun, February 22, 2015 4:44 am, David Benfell wrote: > On Sun, Feb 22, 2015 at 09:19:56AM +0100, Polytropon wrote: >> On Sat, 21 Feb 2015 17:03:50 -0600, cpet wrote: >> > As well as don't use stable on a production box as STABLE doesn't mean >> > what it means. >> >> STABLE means that the API/ABI is stable. Unlike HEAD (CURRENT), >> STABLE still is actually _stable_ in most cases, so it's a valid >> solution for production systems (given that you're prepared well, >> and you know what you're doing). I'm running STABLE on few >> production machines myself (where this is needed), but I usually >> prefer (and often recommend) using RELEASE and add the security >> patches when they are available. >> > Thinking about this more, I'm inclined to think my problem is not with > the base system. I haven't seen *any* crashes with stuff that can be > clearly identified as being in the base system, let alone the kernel. > > My memory test has just completed a 4th pass with zero errors. It's > now been running for 7.5 hours. > How long does the box run before segfault? Some memory errors may happen with smaller probability, then short memtest may be OK, not detecting memory errors happening less often. What is the load of machine when segfault happens? During memtest86 the load is "zero". During actual server run, you may be heating the interior of the box to higher temperatures, namely memory controller to higher temperatures, which increases chance of malfunction. Do you have ECC memory or non-ECC? If non-ECC can you replace it with ECC? (some memory controllers accept both). Is it possible that you have mixture of different types of RAM attached to the same memory controller (I've seen even different brands claiming the same specs did cause occasional malfunctions). Also, which slots do you use for RAM? If not all slots have RAM, start filling the slots that are farther away from memory controller (which is on CPU substrate these days, hence from CPU). If you leave fartherst slots open you will have open (not terminated) portion of transmission line causing reflections interfering with signal, leading to trouble. Some fancy system boards do have memory bus terminators so what I said about slots deasn't matter for them, but majority of boards do not. If the hardware is a suspect, I would begin with minimal amount of known good RAM. Swapping RAM between good and bad machines is another thing to try. I however, would try instead to swap hard drives, and see which of machines will start failing after that. This way you will know for sure if software (+ hard drive) is to blame (if different machine starts failing) or hardware (if the same machine with system from good machine keeps failing). Goog luck! Valeri ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++