From owner-freebsd-questions@freebsd.org Fri Aug 12 15:51:52 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B8CADBB67F6 for ; Fri, 12 Aug 2016 15:51:52 +0000 (UTC) (envelope-from robert@webtent.org) Received: from mx2.webtent.net (mx2.webtent.net [216.139.202.4]) by mx1.freebsd.org (Postfix) with ESMTP id 7657B11CB for ; Fri, 12 Aug 2016 15:51:51 +0000 (UTC) (envelope-from robert@webtent.org) Received: from localhost (localhost [127.0.0.1]) by mx2.webtent.net (WebTent ESMTP Postfix Internet Mail Exchange) with ESMTP id A8335D7DAE; Fri, 12 Aug 2016 11:51:50 -0400 (EDT) Received: from mx2.webtent.net ([127.0.0.1]) by localhost (mx2.webtent.net [127.0.0.1]) (maiad, port 10024) with ESMTP id 48582-09; Fri, 12 Aug 2016 11:51:50 -0400 (EDT) Received: from [192.168.1.105] (unknown [96.254.71.164]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: robert@mx2.webtent.net) by mx2.webtent.net (WebTent ESMTP Postfix Internet Mail Exchange) with ESMTPSA id C8A29D7DB3; Fri, 12 Aug 2016 11:51:49 -0400 (EDT) Message-ID: <57ADF096.8010608@webtent.org> Date: Fri, 12 Aug 2016 11:51:50 -0400 From: Robert Fitzpatrick User-Agent: Postbox 4.0.8 (Windows/20151105) MIME-Version: 1.0 To: galtsev@kicp.uchicago.edu CC: FreeBSD Subject: Re: Monitoring server for crashes References: <57ADDA5F.4000405@webtent.org> <61294.128.135.52.6.1471013465.squirrel@cosmo.uchicago.edu> In-Reply-To: <61294.128.135.52.6.1471013465.squirrel@cosmo.uchicago.edu> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Scanned: WebTent Mailguard 1.0.3 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Aug 2016 15:51:52 -0000 Valeri Galtsev wrote: > Before doing such monitoring I would really do a good hardware test. > Incidentally, who is hardware manufacturer (just for my curiosity). The > usual suspects are: memory (poor/flaky memory, or combination of memory > with slightly different specs; these even though they may work together > can lead to failure sometimes very rarely, like once every 6 Months which > is really hard to troubleshoot: just avoid this). Another possibility: > tripping temperature threshold set in BIOS. (These, BTW will leave no > tracks in crash, logs etc.) Check this and bring threshold some 15-20 F (7 > - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think, > AMD will withstand any abuse without failing: you almost can boil water on > these, Intels are not as robust). What I would do is : open the box, leave > minimal hardware (run with minimal amount of RAM, remove all extra cards > etc) and see if you have problem with this minimal hardware configuration. > If not, start adding hardware, install all RAM first, test if it doesn't > crash. Run memtest96 at this point for at least 48 hours (or at the very > minimum 2-3 full loops of test). In this configuration try to run system > and create significant CPU load (several multi-thread "build world" can > help do that), and simultaneously try to use all the RAM. Things are > slightly different under heavy load. And so on - add the rest of hardware > and test... One more thing: check if your PS provides at least 30% more > power than all hardware may need. Marginally insufficient power may lead > to unpredictable thing on PCI bus. Incidentally, how old is power supply > (and the rest of hardware). Electrolytic capacitors may loose capacitance > with age, thus not filtering well enough ripple on PS leads (capacitors > inside PS), on CPU power leads and on PCI bus power lines (capacitors on > system board - check if they do not showing traces of leakage). > Thanks for all the suggestions, will check temp and other info in BIOS tonight, I really can't have the server down for long memory test, will make sure all memory is the same. The server is IBM x3650 with 2 Quad Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB of RAM. I purchased second hand in 2011. I have a screenshot of the product data screen in the BIOS, it has a diagnostics date of Aug 2009 in the BIOS, all hardware should be original except drives and memory. The load comes from a PostgreSQL database primarily, also provides DNS and LDAP services. Not sure heat is the issue, mainly happens at the same general time at night, heaviest load is definitely during the day. I see now, most of the time it happens during dumping of the db each night, but it has happened once during the day and once a couple of hours before backup. I'm leaning toward a memory issue and will definitely visit the data center tonight and see the types. The db size has not changed much over time and this just started recently. It is a SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I will disable and do dump manually tonight, 90% of the time it seems to be going down during backup of the largest db. Perhaps the crashes have caused a table to corrupt, I 'fsck -y' all mounts in single user mode after every crash. -- Robert