From owner-freebsd-questions@freebsd.org Sat Aug 13 18:26:45 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6C8F0BB809A for ; Sat, 13 Aug 2016 18:26:45 +0000 (UTC) (envelope-from wam@hiwaay.net) Received: from fly.hiwaay.net (fly.hiwaay.net [216.180.54.1]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DFA401009 for ; Sat, 13 Aug 2016 18:26:44 +0000 (UTC) (envelope-from wam@hiwaay.net) Received: from kabini1.local (dynamic-216-186-209-65.knology.net [216.186.209.65] (may be forged)) (authenticated bits=0) by fly.hiwaay.net (8.13.8/8.13.8/fly) with ESMTP id u7DIQa5d021955 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO) for ; Sat, 13 Aug 2016 13:26:37 -0500 Subject: Re: Monitoring server for crashes References: <20160813234226.N79687@sola.nimnet.asn.au> Cc: freebsd-questions@freebsd.org From: "William A. Mahaffey III" Message-ID: <398070bd-057f-55bb-2b17-4858f9450c5c@hiwaay.net> Date: Sat, 13 Aug 2016 13:32:05 -0453.75 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <20160813234226.N79687@sola.nimnet.asn.au> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 13 Aug 2016 18:26:45 -0000 On 08/13/16 09:33, Ian Smith wrote: > In freebsd-questions Digest, Vol 636, Issue 7, Message: 10 > On Fri, 12 Aug 2016 11:51:50 -0400 Robert Fitzpatrick wrote: > > Valeri Galtsev wrote: > > > Before doing such monitoring I would really do a good hardware test. > > > Incidentally, who is hardware manufacturer (just for my curiosity). The > > > usual suspects are: memory (poor/flaky memory, or combination of memory > > > with slightly different specs; these even though they may work together > > > can lead to failure sometimes very rarely, like once every 6 Months which > > > is really hard to troubleshoot: just avoid this). Another possibility: > > > tripping temperature threshold set in BIOS. (These, BTW will leave no > > > tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7 > > > - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think, > > > AMD will withstand any abuse without failing: you almost can boil water on > > > these, Intels are not as robust). What I would do is : open the box, leave > > > minimal hardware (run with minimal amount of RAM, remove all extra cards > > > etc) and see if you have problem with this minimal hardware configuration. > > > If not, start adding hardware, install all RAM first, test if it doesn't > > > crash. Run memtest96 at this point for at least 48 hours (or at the very > > > minimum 2-3 full loops of test). In this configuration try to run system > > > and create significant CPU load (several multi-thread "build world" can > > > help do that), and simultaneously try to use all the RAM. Things are > > > slightly different under heavy load. And so on - add the rest of hardware > > > and test... One more thing: check if your PS provides at least 30% more > > > power than all hardware may need. Marginally insufficient power may lead > > > to unpredictable thing on PCI bus. Incidentally, how old is power supply > > > (and the rest of hardware). Electrolytic capacitors may loose capacitance > > > with age, thus not filtering well enough ripple on PS leads (capacitors > > > inside PS), on CPU power leads and on PCI bus power lines (capacitors on > > > system board - check if they do not showing traces of leakage). > > All good advice Valeri; not sure about messing with temps in BIOS though > .. FreeBSD should be handling that ok via ACPI thermal Zones (versus > _HOT and _CRT temperatures) which should cleanly shutdown at _CRT temp. > That said, if it gets anywhere near that hot there's a serious issue .. > > > Thanks for all the suggestions, will check temp and other info in BIOS > > tonight, I really can't have the server down for long memory test, will > > make sure all memory is the same. The server is IBM x3650 with 2 Quad > > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB > > of RAM. I purchased second hand in 2011. I have a screenshot of the > > product data screen in the BIOS, it has a diagnostics date of Aug 2009 > > in the BIOS, all hardware should be original except drives and memory. > > The load comes from a PostgreSQL database primarily, also provides DNS > > and LDAP services. Not sure heat is the issue, mainly happens at the > > same general time at night, heaviest load is definitely during the day. > > I guess you've checked with ibm re a BIOS update .. 2009 is a while ago. > > Apart from RAM, which rarely just 'goes bad' esp. on server grade gear, > but "rarely happens" happens too. > > First thing I'd suspect at that age would be the power supply - can you > swap it with another? Quickest fix if it works - and it was needed. > > Second would be temperature, possibly fan/s - which is also the primary > cause of blown P/S in my experience. Below is a script I run from cron > from 02:59 through 3:09 to record load averages and temperatures through > daily maintenance from 3:01, every 10 seconds - for debugging a load > average issue, not relevant here. Or you can run it over SSH at home, > and read the last entries over breakfast, whether it crashes or not .. > > The lack of any messages - and you should see one if ACPI thermal zone > detection and forced shutdown is working properly - suggests more of a > hardware seizure, but at 10 second testing you could see if temps (and > load) were a problem prior to crash, at least if it happens in a window. > > > I see now, most of the time it happens during dumping of the db each > > night, but it has happened once during the day and once a couple of > > hours before backup. I'm leaning toward a memory issue and will > > definitely visit the data center tonight and see the types. The db size > > has not changed much over time and this just started recently. It is a > > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I > > will disable and do dump manually tonight, 90% of the time it seems to > > be going down during backup of the largest db. Perhaps the crashes have > > caused a table to corrupt, I 'fsck -y' all mounts in single user mode > > after every crash. > > Do the fscks log success or any problems then? If not, might be worth > doing manual fsck to check? > > /etc/crontab: > 59 2 * * * root /root/bin/loadavg_daily > > /root/bin/loadavg_daily: > ======= > #!/bin/sh > # 19Feb16 loadavg_daily .. every 10 seconds from 02:59 to 03:09 (run by cron) > log='/root/loadavg_daily.log' > secs=10 > i=0 > /root/bin/x200stat >> $log # or something else, or nothing .. > while [ $i -lt 60 ]; do > echo -n "`uptime` " >> $log > echo "`sysctl -n hw.acpi.thermal.tz0.temperature`" \ > "`sysctl -n hw.acpi.thermal.tz1.temperature`" >> $log > sleep $secs > i=$((i + 1)) > done > /root/bin/x200stat >> $log > echo >> $log > ======= > > Check sysctl hw.acpi.thermal for your thermal zones of interest. > > HTH, Ian > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions-unsubscribe@freebsd.org" > Out of curiosity, I tried the above command under 9.3R: [wam@kabini1, ~, 1:30:25pm] 581 % sysctl -n hw.acpi.thermal.tz1.temperature sysctl: unknown oid 'hw.acpi.thermal.tz1.temperature' [wam@kabini1, ~, 1:30:46pm] 582 % uname -a FreeBSD kabini1.local 9.3-RELEASE-p33 FreeBSD 9.3-RELEASE-p33 #0: Wed Jan 13 17:55:39 UTC 2016 root@amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC amd64 [wam@kabini1, ~, 1:31:58pm] 583 % When did it become available ? -- William A. Mahaffey III ---------------------------------------------------------------------- "The M1 Garand is without doubt the finest implement of war ever devised by man." -- Gen. George S. Patton Jr.