From owner-freebsd-questions@freebsd.org Fri Aug 12 18:07:55 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3143BBB8D1A for ; Fri, 12 Aug 2016 18:07:55 +0000 (UTC) (envelope-from ultima1252@gmail.com) Received: from mail-yb0-x22f.google.com (mail-yb0-x22f.google.com [IPv6:2607:f8b0:4002:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E2C551578 for ; Fri, 12 Aug 2016 18:07:54 +0000 (UTC) (envelope-from ultima1252@gmail.com) Received: by mail-yb0-x22f.google.com with SMTP id w8so9909453ybe.0 for ; Fri, 12 Aug 2016 11:07:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:cc; bh=KAfp/g6xwZDFVgrg1DzRJoVP3WqFdw5q2Lemq7T67F4=; b=BAtagg9ifFWjjPQZWu/oq2Iy5+JftEoTfTn2+gBdg07aKIVIqbo/u2TFTX1fHt92XI 1guHCkV/UBsejFEzBDRglUowd5DxszQlGG2ZEr1LjxJiwYkhsf/p97UNcO1AufCbC6mb Yp8bNIgndjt/qyvN29SqVp0ttvkvAvDI9eell1n5sDGuItvuLQhCzHoDVdAhrkVJ0R0h Dt+2cu17wGuWQo5S0S344RZJ1TcZ/4T7kNEGKwLC30hg7WJZYaPrt33doez+kXZutlMU FNpvCqa9+sfo2/5CY7dEjunOpJUNuV2OaohN6BYm1MqkvZti+Rgdrc+mJxi6laBjve4c NjGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:cc; bh=KAfp/g6xwZDFVgrg1DzRJoVP3WqFdw5q2Lemq7T67F4=; b=c/0e8IRXTqR1vfN4Le/qqCvkbKatnGEWH4l49bo+AKBCm68Dj6G8/yZoFsFV+sandA jpXZLHPYr3cYAgPT6VveR5adv5G2gtG0ImOM6ZeblOlpISb1s2cFWhG9+MW2S+S4w4KP ta1RNmhA0jTDycfZReh/Sp55XRgfF25N/nueimyHHx9iWI3kDqSeKdQQ9OhgNWoxE1GY J/6T3yshXAhGuyr86jf8c85bijegWH7wA4Qoyskzxmhw/txUB33jIyVeDl7rF/W0TPn4 bvNlhrAL4X4c0DHL//8rIWGDkl1KVxuUeIvq0EUNQTd+TvHZwqt6KW0UTMegBpFLgbar 6XpQ== X-Gm-Message-State: AEkoousI75GQgXwsDP3hrltXvENjBnCYR693L4BH8ZGns9PAsxexe/K6eF2htRRvqjkrvvMOmw8BJ/+iZd4yMQ== X-Received: by 10.37.207.80 with SMTP id f77mt6016882ybg.141.1471025274022; Fri, 12 Aug 2016 11:07:54 -0700 (PDT) MIME-Version: 1.0 Received: by 10.129.51.150 with HTTP; Fri, 12 Aug 2016 11:07:53 -0700 (PDT) In-Reply-To: <11590.128.135.52.6.1471018231.squirrel@cosmo.uchicago.edu> References: <57ADDA5F.4000405@webtent.org> <61294.128.135.52.6.1471013465.squirrel@cosmo.uchicago.edu> <57ADF096.8010608@webtent.org> <11590.128.135.52.6.1471018231.squirrel@cosmo.uchicago.edu> From: Ultima Date: Fri, 12 Aug 2016 14:07:53 -0400 Message-ID: Subject: Re: Monitoring server for crashes Cc: Robert Fitzpatrick , FreeBSD Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 12 Aug 2016 18:07:55 -0000 Please provide exact version of FreeBSD, I recall an issue in 10.2, a cron job with exact symptoms and was fixed with updating. I doubt this is the problem however providing a more precise version information can help narrow down software related issues. On Fri, Aug 12, 2016 at 12:10 PM, Valeri Galtsev wrote: > > On Fri, August 12, 2016 10:51 am, Robert Fitzpatrick wrote: > > Valeri Galtsev wrote: > >> Before doing such monitoring I would really do a good hardware test. > >> Incidentally, who is hardware manufacturer (just for my curiosity). The > >> usual suspects are: memory (poor/flaky memory, or combination of memory > >> with slightly different specs; these even though they may work together > >> can lead to failure sometimes very rarely, like once every 6 Months > >> which > >> is really hard to troubleshoot: just avoid this). Another possibility: > >> tripping temperature threshold set in BIOS. (These, BTW will leave no > >> tracks in crash, logs etc.) Check this and bring threshold some 15-20 F > >> (7 > >> - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think, > >> AMD will withstand any abuse without failing: you almost can boil water > >> on > >> these, Intels are not as robust). What I would do is : open the box, > >> leave > >> minimal hardware (run with minimal amount of RAM, remove all extra cards > >> etc) and see if you have problem with this minimal hardware > >> configuration. > >> If not, start adding hardware, install all RAM first, test if it doesn't > >> crash. Run memtest96 at this point for at least 48 hours (or at the very > >> minimum 2-3 full loops of test). In this configuration try to run system > >> and create significant CPU load (several multi-thread "build world" can > >> help do that), and simultaneously try to use all the RAM. Things are > >> slightly different under heavy load. And so on - add the rest of > >> hardware > >> and test... One more thing: check if your PS provides at least 30% more > >> power than all hardware may need. Marginally insufficient power may lead > >> to unpredictable thing on PCI bus. Incidentally, how old is power supply > >> (and the rest of hardware). Electrolytic capacitors may loose > >> capacitance > >> with age, thus not filtering well enough ripple on PS leads (capacitors > >> inside PS), on CPU power leads and on PCI bus power lines (capacitors on > >> system board - check if they do not showing traces of leakage). > >> > > > > Thanks for all the suggestions, will check temp and other info in BIOS > > tonight, I really can't have the server down for long memory test, will > > make sure all memory is the same. The server is IBM x3650 with 2 Quad > > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB > > of RAM. > > Sound like memory under heavy load. I definitely would: > > 1. re-seat all RAM modules. > > 2. While doing 1 check all modules are same brand same part number. I > don't remember off hand if your CPU has its memory controller (like in AMD > opterons) or it is older "memory bus" used by all CPUs, and memory > controller sits on system board, In last case I would just stick extra FAN > on that memory controller chip. If memory controllers are on CPU dies, the > make sure that memory modules connected to given CPU are the same; they > can be [somewhat] different from ones connected to different CPU. > Basically: all RAM modules connected to the same memory controller should > be teh same. > > Do I get it correctly: this machine (purchased used) originally run > without problems for you (for multiple Months), right? > > One more thing I wouldn't exclude: used system board may have fried > PCI-express slot, if you have something in it, the machine will be flaky. > I had it once ;-( If you can remove everything, or just move extra cards > to different slots, this may help you to test this. > > Good luck! > > > I purchased second hand in 2011. I have a screenshot of the > > product data screen in the BIOS, it has a diagnostics date of Aug 2009 > > in the BIOS, all hardware should be original except drives and memory. > > The load comes from a PostgreSQL database primarily, also provides DNS > > and LDAP services. Not sure heat is the issue, mainly happens at the > > same general time at night, heaviest load is definitely during the day. > > > > I see now, most of the time it happens during dumping of the db each > > night, but it has happened once during the day and once a couple of > > hours before backup. I'm leaning toward a memory issue and will > > definitely visit the data center tonight and see the types. The db size > > has not changed much over time and this just started recently. It is a > > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I > > will disable and do dump manually tonight, 90% of the time it seems to > > be going down during backup of the largest db. Perhaps the crashes have > > caused a table to corrupt, I 'fsck -y' all mounts in single user mode > > after every crash. > > > > -- > > Robert > > > > _______________________________________________ > > freebsd-questions@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > > To unsubscribe, send any mail to > > "freebsd-questions-unsubscribe@freebsd.org" > > > > > ++++++++++++++++++++++++++++++++++++++++ > Valeri Galtsev > Sr System Administrator > Department of Astronomy and Astrophysics > Kavli Institute for Cosmological Physics > University of Chicago > Phone: 773-702-4247 > ++++++++++++++++++++++++++++++++++++++++ > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions- > unsubscribe@freebsd.org" >