Date: Fri, 12 Aug 2016 14:07:53 -0400 From: Ultima <ultima1252@gmail.com> Cc: Robert Fitzpatrick <robert@webtent.org>, FreeBSD <freebsd-questions@freebsd.org> Subject: Re: Monitoring server for crashes Message-ID: <CANJ8om67oVywpw_YMhypawzFQjbAHZsvcVi1GD8J89R0g2vSYg@mail.gmail.com> In-Reply-To: <11590.128.135.52.6.1471018231.squirrel@cosmo.uchicago.edu> References: <57ADDA5F.4000405@webtent.org> <61294.128.135.52.6.1471013465.squirrel@cosmo.uchicago.edu> <57ADF096.8010608@webtent.org> <11590.128.135.52.6.1471018231.squirrel@cosmo.uchicago.edu>
next in thread | previous in thread | raw e-mail | index | archive | help
Please provide exact version of FreeBSD, I recall an issue in 10.2, a cron job with exact symptoms and was fixed with updating. I doubt this is the problem however providing a more precise version information can help narrow down software related issues. On Fri, Aug 12, 2016 at 12:10 PM, Valeri Galtsev <galtsev@kicp.uchicago.edu> wrote: > > On Fri, August 12, 2016 10:51 am, Robert Fitzpatrick wrote: > > Valeri Galtsev wrote: > >> Before doing such monitoring I would really do a good hardware test. > >> Incidentally, who is hardware manufacturer (just for my curiosity). The > >> usual suspects are: memory (poor/flaky memory, or combination of memory > >> with slightly different specs; these even though they may work together > >> can lead to failure sometimes very rarely, like once every 6 Months > >> which > >> is really hard to troubleshoot: just avoid this). Another possibility: > >> tripping temperature threshold set in BIOS. (These, BTW will leave no > >> tracks in crash, logs etc.) Check this and bring threshold some 15-20 F > >> (7 > >> - 10 C ) up. Incidentally: which CPU(s) do you have? (I'm used to think, > >> AMD will withstand any abuse without failing: you almost can boil water > >> on > >> these, Intels are not as robust). What I would do is : open the box, > >> leave > >> minimal hardware (run with minimal amount of RAM, remove all extra cards > >> etc) and see if you have problem with this minimal hardware > >> configuration. > >> If not, start adding hardware, install all RAM first, test if it doesn't > >> crash. Run memtest96 at this point for at least 48 hours (or at the very > >> minimum 2-3 full loops of test). In this configuration try to run system > >> and create significant CPU load (several multi-thread "build world" can > >> help do that), and simultaneously try to use all the RAM. Things are > >> slightly different under heavy load. And so on - add the rest of > >> hardware > >> and test... One more thing: check if your PS provides at least 30% more > >> power than all hardware may need. Marginally insufficient power may lead > >> to unpredictable thing on PCI bus. Incidentally, how old is power supply > >> (and the rest of hardware). Electrolytic capacitors may loose > >> capacitance > >> with age, thus not filtering well enough ripple on PS leads (capacitors > >> inside PS), on CPU power leads and on PCI bus power lines (capacitors on > >> system board - check if they do not showing traces of leakage). > >> > > > > Thanks for all the suggestions, will check temp and other info in BIOS > > tonight, I really can't have the server down for long memory test, will > > make sure all memory is the same. The server is IBM x3650 with 2 Quad > > Core Xeon L5420 a mixture of drives using hardware ServeRAID 8k and 12GB > > of RAM. > > Sound like memory under heavy load. I definitely would: > > 1. re-seat all RAM modules. > > 2. While doing 1 check all modules are same brand same part number. I > don't remember off hand if your CPU has its memory controller (like in AMD > opterons) or it is older "memory bus" used by all CPUs, and memory > controller sits on system board, In last case I would just stick extra FAN > on that memory controller chip. If memory controllers are on CPU dies, the > make sure that memory modules connected to given CPU are the same; they > can be [somewhat] different from ones connected to different CPU. > Basically: all RAM modules connected to the same memory controller should > be teh same. > > Do I get it correctly: this machine (purchased used) originally run > without problems for you (for multiple Months), right? > > One more thing I wouldn't exclude: used system board may have fried > PCI-express slot, if you have something in it, the machine will be flaky. > I had it once ;-( If you can remove everything, or just move extra cards > to different slots, this may help you to test this. > > Good luck! > > > I purchased second hand in 2011. I have a screenshot of the > > product data screen in the BIOS, it has a diagnostics date of Aug 2009 > > in the BIOS, all hardware should be original except drives and memory. > > The load comes from a PostgreSQL database primarily, also provides DNS > > and LDAP services. Not sure heat is the issue, mainly happens at the > > same general time at night, heaviest load is definitely during the day. > > > > I see now, most of the time it happens during dumping of the db each > > night, but it has happened once during the day and once a couple of > > hours before backup. I'm leaning toward a memory issue and will > > definitely visit the data center tonight and see the types. The db size > > has not changed much over time and this just started recently. It is a > > SpamAssassin/ClamAV db and purges, vacuums every night after dumping. I > > will disable and do dump manually tonight, 90% of the time it seems to > > be going down during backup of the largest db. Perhaps the crashes have > > caused a table to corrupt, I 'fsck -y' all mounts in single user mode > > after every crash. > > > > -- > > Robert > > > > _______________________________________________ > > freebsd-questions@freebsd.org mailing list > > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > > To unsubscribe, send any mail to > > "freebsd-questions-unsubscribe@freebsd.org" > > > > > ++++++++++++++++++++++++++++++++++++++++ > Valeri Galtsev > Sr System Administrator > Department of Astronomy and Astrophysics > Kavli Institute for Cosmological Physics > University of Chicago > Phone: 773-702-4247 > ++++++++++++++++++++++++++++++++++++++++ > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to "freebsd-questions- > unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANJ8om67oVywpw_YMhypawzFQjbAHZsvcVi1GD8J89R0g2vSYg>