From owner-freebsd-questions@freebsd.org  Fri Aug 12 14:51:07 2016
Return-Path: <owner-freebsd-questions@freebsd.org>
Delivered-To: freebsd-questions@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B5C36BB7ED5
 for <freebsd-questions@mailman.ysv.freebsd.org>;
 Fri, 12 Aug 2016 14:51:07 +0000 (UTC)
 (envelope-from galtsev@kicp.uchicago.edu)
Received: from cosmo.uchicago.edu (cosmo.uchicago.edu [128.135.70.90])
 by mx1.freebsd.org (Postfix) with ESMTP id 9739011D4
 for <freebsd-questions@freebsd.org>; Fri, 12 Aug 2016 14:51:07 +0000 (UTC)
 (envelope-from galtsev@kicp.uchicago.edu)
Received: by cosmo.uchicago.edu (Postfix, from userid 48)
 id 17E96CB8C8D; Fri, 12 Aug 2016 09:51:05 -0500 (CDT)
Received: from 128.135.52.6 (SquirrelMail authenticated user valeri)
 by cosmo.uchicago.edu with HTTP;
 Fri, 12 Aug 2016 09:51:05 -0500 (CDT)
Message-ID: <61294.128.135.52.6.1471013465.squirrel@cosmo.uchicago.edu>
In-Reply-To: <57ADDA5F.4000405@webtent.org>
References: <57ADDA5F.4000405@webtent.org>
Date: Fri, 12 Aug 2016 09:51:05 -0500 (CDT)
Subject: Re: Monitoring server for crashes
From: "Valeri Galtsev" <galtsev@kicp.uchicago.edu>
To: "Robert Fitzpatrick" <robert@webtent.org>
Cc: "FreeBSD" <freebsd-questions@freebsd.org>
Reply-To: galtsev@kicp.uchicago.edu
User-Agent: SquirrelMail/1.4.8-5.el5.centos.7
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
X-Priority: 3 (Normal)
Importance: Normal
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 12 Aug 2016 14:51:07 -0000


On Fri, August 12, 2016 9:17 am, Robert Fitzpatrick wrote:
> We have a FreeBSD 10 server that keeps crashing every night. I have
> dumpdev set to AUTO in rc.conf, but I get nothing in the /var/crash
> folder, I don't have dumpdir defined. The messages log just cuts off
> with no evidence of kernel panic. Perhaps it is a memory or power issue,
> how can I monitor for the cause?
>

Before doing such monitoring I would really do a good hardware test.
Incidentally, who is hardware manufacturer (just for my curiosity). The
usual suspects are: memory (poor/flaky memory, or combination of memory
with slightly different specs; these even though they may work together
can lead to failure sometimes very rarely, like once every 6 Months which
is really hard to troubleshoot: just avoid this). Another possibility:
tripping temperature threshold set in BIOS. (These, BTW will leave no
tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7
- 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think,
AMD will withstand any abuse without failing: you almost can boil water on
these, Intels are not as robust). What I would do is : open the box, leave
minimal hardware (run with minimal amount of RAM, remove all extra cards
etc) and see if you have problem with this minimal hardware configuration.
If not, start adding hardware, install all RAM first, test if it doesn't
crash. Run memtest96 at this point for at least 48 hours (or at the very
minimum 2-3 full loops of test). In this configuration try to run system
and create significant CPU load (several multi-thread "build world" can
help do that), and simultaneously try to use all the RAM. Things are
slightly different under heavy load. And so on - add the rest of hardware
and test... One more thing: check if your PS provides at least 30% more
power than all hardware may need. Marginally insufficient power may lead
to unpredictable thing on PCI bus. Incidentally, how old is power supply
(and the rest of hardware). Electrolytic capacitors may loose capacitance
with age, thus not filtering well enough ripple on PS leads (capacitors
inside PS), on CPU power leads and on PCI bus power lines (capacitors on
system board - check if they do not showing traces of leakage).

Good luck.

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++