Date: Thu, 1 Jun 2017 10:03:27 -0500 (CDT) From: "Valeri Galtsev" <galtsev@kicp.uchicago.edu> To: "Ian Smith" <smithi@nimnet.asn.au> Cc: "Raimo Niskanen" <raimo+freebsd@erix.ericsson.se>, freebsd-questions@freebsd.org Subject: Re: Advice on kernel panics Message-ID: <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu> In-Reply-To: <20170601235447.C98304@sola.nimnet.asn.au> References: <mailman.103.1496318402.46813.freebsd-questions@freebsd.org> <20170601235447.C98304@sola.nimnet.asn.au>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, June 1, 2017 9:34 am, Ian Smith wrote: > In freebsd-questions Digest, Vol 678, Issue 4, Message: 4 > On Thu, 1 Jun 2017 10:27:49 +0200 Raimo Niskanen > <raimo+freebsd@erix.ericsson.se> wrote: > > On Thu, Jun 01, 2017 at 12:10:30AM -0500, Doug McIntyre wrote: > > > On Mon, May 29, 2017 at 11:20:43AM +0200, Raimo Niskanen wrote: > > > > I have a server that panics about every 3 days and need some advice > on how > > > > to handle that. > > > > > > I'd expect it is some sort of hardware failure, as I would expect kernel panics more on the order of once a decade with FreeBSD. Ie. I've seen one or two on my hundred or so servers, but its pretty > rare. > > > > > > Check and recheck your hardware items. > > > > I have removed one of four memory capsules - panicked again. Will > rotate > > through all of them... > > > > > > > > Runup memtest86+. Check your drive hardware, turn on SMART checking. > > > > I have run memtest86+ over night - no errors found. > > > > I have installed smartmontools - no errors found, short and long self > tests > > on both disks run fine. zpool scrub repaired 0 errors and has no known > data > > errors. > > Everyone's suggesting hardware problems, and it's certainly worthwhile eliminating that possibility - but this could be a software/OS issue. I would agree with Ian, it can be software, though it is less likely. I have seen a few times that SCSI attached external RAID (attached to LSI SCSI HBA) was announcing change of its status (like rebuilt finished or drive timed out/failed) which simultaneously with other traffic on SCSI bus confused adapter and led to kernel panic. That said, I will first check hardware thoroughly. Andrea mentioned aged PS under heavy load. And these are prime suspects. Of all components electrolytic capacitors are the ones degraded most, may even leak, and they don't filter ripple sufficiently, thus leading to ripple beyond tolerable at high currents. So: 1. open the box, and inspect interior. System board ("motherboard" is its jargon name for over 30 years): inspect electrolytic capacitors around CPU(s), and those that filter PCI (or PCI-X, or PCI-E) bus power leads. Any of them bulged, or even have traces of leaked electrolyte (brown residue usually) - throw away system board. The model of your box fall into the time span when they used worst electrolytic capacitors. 2. re-seat all components (including expansion boards, memory, CPU is less likely, but I would do that too), disconnect and reconnect all connectors. Contacts, even gold plated, sometimes do oxidize 3. Get new power supply, not necessarily designed for this machine, but with the same connectors to the system board, and with higher power rating. disconnect box's own PS, and power it from new PS; see if it stops failing (PSes do have electrolytic capacitors inside as well; other components do not degrade but do not die totally, except for ultra high frequency diodes and transistors, and very high voltage diodes) Good luck! Valeri > > If it were me and hardware all checks out, I'd try posting the original report - plus other details about the box and setup that you've since mentioned - to freebsd-stable@, or maybe freebsd-fs@ since those fstat reports seem to point to possible FS/zfs issues? at a wild guess .. > > One other hardware tester you might try is sysutils/stress which can pound CPU, I/O, VM, disk as hard and for as long as you like, without having to bring the box down. I've used this lots to generate heavy loads. Keep a close eye on system temperatures during longer tests. > > Ah, just before posting, I see your latest with dmesg. Just on a quick scan, I wonder if these are a bad indication? Maybe just a side-issue, but powerd might not work, so again heat might be something to watch: > > est0: <Enhanced SpeedStep Frequency Control> on cpu0 > est: CPU supports Enhanced Speedstep, but is not recognized. > > cheers, Ian > _______________________________________________ > freebsd-questions@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-questions > To unsubscribe, send any mail to > "freebsd-questions-unsubscribe@freebsd.org" > ++++++++++++++++++++++++++++++++++++++++ Valeri Galtsev Sr System Administrator Department of Astronomy and Astrophysics Kavli Institute for Cosmological Physics University of Chicago Phone: 773-702-4247 ++++++++++++++++++++++++++++++++++++++++
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?33501.128.135.52.6.1496329407.squirrel>