From owner-freebsd-questions@freebsd.org  Sat Aug 13 18:26:45 2016
Return-Path: <owner-freebsd-questions@freebsd.org>
Delivered-To: freebsd-questions@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6C8F0BB809A
 for <freebsd-questions@mailman.ysv.freebsd.org>;
 Sat, 13 Aug 2016 18:26:45 +0000 (UTC) (envelope-from wam@hiwaay.net)
Received: from fly.hiwaay.net (fly.hiwaay.net [216.180.54.1])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DFA401009
 for <freebsd-questions@freebsd.org>; Sat, 13 Aug 2016 18:26:44 +0000 (UTC)
 (envelope-from wam@hiwaay.net)
Received: from kabini1.local (dynamic-216-186-209-65.knology.net
 [216.186.209.65] (may be forged)) (authenticated bits=0)
 by fly.hiwaay.net (8.13.8/8.13.8/fly) with ESMTP id u7DIQa5d021955
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO)
 for <freebsd-questions@freebsd.org>; Sat, 13 Aug 2016 13:26:37 -0500
Subject: Re: Monitoring server for crashes
References: <mailman.115.1471089602.58418.freebsd-questions@freebsd.org>
 <20160813234226.N79687@sola.nimnet.asn.au>
Cc: freebsd-questions@freebsd.org
From: "William A. Mahaffey III" <wam@hiwaay.net>
Message-ID: <398070bd-057f-55bb-2b17-4858f9450c5c@hiwaay.net>
Date: Sat, 13 Aug 2016 13:32:05 -0453.75
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <20160813234226.N79687@sola.nimnet.asn.au>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-questions@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: User questions <freebsd-questions.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-questions/>
List-Post: <mailto:freebsd-questions@freebsd.org>
List-Help: <mailto:freebsd-questions-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>, 
 <mailto:freebsd-questions-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 13 Aug 2016 18:26:45 -0000

On 08/13/16 09:33, Ian Smith wrote:
> In freebsd-questions Digest, Vol 636, Issue 7, Message: 10
> On Fri, 12 Aug 2016 11:51:50 -0400 Robert Fitzpatrick <robert@webtent.org> wrote:
>   > Valeri Galtsev wrote:
>   > > Before doing such monitoring I would really do a good hardware test.
>   > > Incidentally, who is hardware manufacturer (just for my curiosity). The
>   > > usual suspects are: memory (poor/flaky memory, or combination of memory
>   > > with slightly different specs; these even though they may work together
>   > > can lead to failure sometimes very rarely, like once every 6 Months which
>   > > is really hard to troubleshoot: just avoid this). Another possibility:
>   > > tripping temperature threshold set in BIOS. (These, BTW will leave no
>   > > tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7
>   > > - 10 C ) up.  Incidentally: which CPU(s) do you have? (I'm used to think,
>   > > AMD will withstand any abuse without failing: you almost can boil water on
>   > > these, Intels are not as robust). What I would do is : open the box, leave
>   > > minimal hardware (run with minimal amount of RAM, remove all extra cards
>   > > etc) and see if you have problem with this minimal hardware configuration.
>   > > If not, start adding hardware, install all RAM first, test if it doesn't
>   > > crash. Run memtest96 at this point for at least 48 hours (or at the very
>   > > minimum 2-3 full loops of test). In this configuration try to run system
>   > > and create significant CPU load (several multi-thread "build world" can
>   > > help do that), and simultaneously try to use all the RAM. Things are
>   > > slightly different under heavy load. And so on - add the rest of hardware
>   > > and test... One more thing: check if your PS provides at least 30% more
>   > > power than all hardware may need. Marginally insufficient power may lead
>   > > to unpredictable thing on PCI bus. Incidentally, how old is power supply
>   > > (and the rest of hardware). Electrolytic capacitors may loose capacitance
>   > > with age, thus not filtering well enough ripple on PS leads (capacitors
>   > > inside PS), on CPU power leads and on PCI bus power lines (capacitors on
>   > > system board - check if they do not showing traces of leakage).
>
> All good advice Valeri; not sure about messing with temps in BIOS though
> .. FreeBSD should be handling that ok via ACPI thermal Zones (versus
> _HOT and _CRT temperatures) which should cleanly shutdown at _CRT temp.
> That said, if it gets anywhere near that hot there's a serious issue ..
>
>   > Thanks for all the suggestions, will check temp and other info in BIOS
>   > tonight, I really can't have the server down for long memory test, will
>   > make sure all memory is the same. The server is IBM x3650 with 2 Quad
>   > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB
>   > of RAM. I purchased second hand in 2011. I have a screenshot of the
>   > product data screen in the BIOS, it has a diagnostics date of Aug 2009
>   > in the BIOS, all hardware should be original except drives and memory.
>   > The load comes from a PostgreSQL database primarily, also provides DNS
>   > and LDAP services. Not sure heat is the issue, mainly happens at the
>   > same general time at night, heaviest load is definitely during the day.
>
> I guess you've checked with ibm re a BIOS update .. 2009 is a while ago.
>
> Apart from RAM, which rarely just 'goes bad' esp. on server grade gear,
> but "rarely happens" happens too.
>
> First thing I'd suspect at that age would be the power supply - can you
> swap it with another?  Quickest fix if it works - and it was needed.
>
> Second would be temperature, possibly fan/s - which is also the primary
> cause of blown P/S in my experience.  Below is a script I run from cron
> from 02:59 through 3:09 to record load averages and temperatures through
> daily maintenance from 3:01, every 10 seconds - for debugging a load
> average issue, not relevant here.  Or you can run it over SSH at home,
> and read the last entries over breakfast, whether it crashes or not ..
>
> The lack of any messages - and you should see one if ACPI thermal zone
> detection and forced shutdown is working properly - suggests more of a
> hardware seizure, but at 10 second testing you could see if temps (and
> load) were a problem prior to crash, at least if it happens in a window.
>
>   > I see now, most of the time it happens during dumping of the db each
>   > night, but it has happened once during the day and once a couple of
>   > hours before backup. I'm leaning toward a memory issue and will
>   > definitely visit the data center tonight and see the types. The db size
>   > has not changed much over time and this just started recently. It is a
>   > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I
>   > will disable and do dump manually tonight, 90% of the time it seems to
>   > be going down during backup of the largest db. Perhaps the crashes have
>   > caused a table to corrupt, I 'fsck -y' all mounts in single user mode
>   > after every crash.
>
> Do the fscks log success or any problems then?  If not, might be worth
> doing manual fsck to check?
>
> /etc/crontab:
> 59      2       *       *       *       root    /root/bin/loadavg_daily
>
> /root/bin/loadavg_daily:
> =======
> #!/bin/sh
> # 19Feb16 loadavg_daily .. every 10 seconds from 02:59 to 03:09 (run by cron)
> log='/root/loadavg_daily.log'
> secs=10
> i=0
> /root/bin/x200stat >> $log	# or something else, or nothing ..
> while [ $i -lt 60 ]; do
>          echo -n "`uptime`  " >> $log
>          echo "`sysctl -n hw.acpi.thermal.tz0.temperature`" \
>          "`sysctl -n hw.acpi.thermal.tz1.temperature`" >> $log
>          sleep $secs
>          i=$((i + 1))
> done
> /root/bin/x200stat >> $log
> echo >> $log
> =======
>
> Check sysctl hw.acpi.thermal for your thermal zones of interest.
>
> HTH, Ian
> _______________________________________________
> freebsd-questions@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org"
>

Out of curiosity, I tried the above command under 9.3R:

[wam@kabini1, ~, 1:30:25pm] 581 % sysctl -n hw.acpi.thermal.tz1.temperature
sysctl: unknown oid 'hw.acpi.thermal.tz1.temperature'
[wam@kabini1, ~, 1:30:46pm] 582 % uname -a
FreeBSD kabini1.local 9.3-RELEASE-p33 FreeBSD 9.3-RELEASE-p33 #0: Wed 
Jan 13 17:55:39 UTC 2016 
root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64
[wam@kabini1, ~, 1:31:58pm] 583 %

When did it become available ?

-- 

	William A. Mahaffey III

  ----------------------------------------------------------------------

	"The M1 Garand is without doubt the finest implement of war
	 ever devised by man."
                            -- Gen. George S. Patton Jr.