FreeBSD Mail Archives

Date:      Thu, 1 Jun 2017 10:03:27 -0500 (CDT)
From:      "Valeri Galtsev" <galtsev@kicp.uchicago.edu>
To:        "Ian Smith" <smithi@nimnet.asn.au>
Cc:        "Raimo Niskanen" <raimo+freebsd@erix.ericsson.se>, freebsd-questions@freebsd.org
Subject:   Re: Advice on kernel panics
Message-ID:  <33501.128.135.52.6.1496329407.squirrel@cosmo.uchicago.edu>
In-Reply-To: <20170601235447.C98304@sola.nimnet.asn.au>
References:  <mailman.103.1496318402.46813.freebsd-questions@freebsd.org>    <20170601235447.C98304@sola.nimnet.asn.au>

On Thu, June 1, 2017 9:34 am, Ian Smith wrote:
> In freebsd-questions Digest, Vol 678, Issue 4, Message: 4
> On Thu, 1 Jun 2017 10:27:49 +0200 Raimo Niskanen
> <raimo+freebsd@erix.ericsson.se> wrote:
>  > On Thu, Jun 01, 2017 at 12:10:30AM -0500, Doug McIntyre wrote:
>  > > On Mon, May 29, 2017 at 11:20:43AM +0200, Raimo Niskanen wrote:
>  > > > I have a server that panics about every 3 days and need some
advice
> on how
>  > > > to handle that.
>  > >
>  > > I'd expect it is some sort of hardware failure, as I would expect
kernel panics more on the order of once a decade with FreeBSD. Ie.
I've seen one or two on my hundred or so servers, but its pretty
> rare.
>  > >
>  > > Check and recheck your hardware items.
>  >
>  > I have removed one of four memory capsules - panicked again.  Will
> rotate
>  > through all of them...
>  >
>  > >
>  > > Runup memtest86+. Check your drive hardware, turn on SMART
checking.
>  >
>  > I have run memtest86+ over night - no errors found.
>  >
>  > I have installed smartmontools - no errors found, short and long self
> tests
>  > on both disks run fine.  zpool scrub repaired 0 errors and has no
known
> data
>  > errors.
>
> Everyone's suggesting hardware problems, and it's certainly worthwhile
eliminating that possibility - but this could be a software/OS issue.

I would agree with Ian,  it can be software, though it is less likely. I
have seen a few times that SCSI attached external RAID (attached to LSI
SCSI HBA) was announcing change of its status (like rebuilt finished or
drive timed out/failed) which simultaneously with other traffic on SCSI
bus confused adapter and led to kernel panic.

That said, I will first check hardware thoroughly. Andrea mentioned aged
PS under heavy load. And these are prime suspects. Of all components
electrolytic capacitors are the ones degraded most, may even leak, and
they don't filter ripple sufficiently, thus leading to ripple beyond
tolerable at high currents. So:

1. open the box, and inspect interior. System board ("motherboard" is its
jargon name for over 30 years): inspect electrolytic capacitors around
CPU(s), and those that filter PCI (or PCI-X, or PCI-E) bus power leads.
Any of them bulged, or even have traces of leaked electrolyte (brown
residue usually) - throw away system board. The model of your box fall
into the time span when they used worst electrolytic capacitors.

2. re-seat all components (including expansion boards, memory, CPU is less
likely, but I would do that too), disconnect and reconnect all connectors.
Contacts, even gold plated, sometimes do oxidize

3. Get new power supply, not necessarily designed for this machine, but
with the same connectors to the system board, and with higher power
rating. disconnect box's own PS, and power it from new PS; see if it stops
failing (PSes do have electrolytic capacitors inside as well; other
components do not degrade but do not die totally, except for ultra high
frequency diodes and transistors, and very high voltage diodes)

Good luck!

Valeri

>
> If it were me and hardware all checks out, I'd try posting the original
report - plus other details about the box and setup that you've since
mentioned - to freebsd-stable@, or maybe freebsd-fs@ since those fstat
reports seem to point to possible FS/zfs issues? at a wild guess ..
>
> One other hardware tester you might try is sysutils/stress which can
pound CPU, I/O, VM, disk as hard and for as long as you like, without
having to bring the box down.  I've used this lots to generate heavy
loads.  Keep a close eye on system temperatures during longer tests.
>
> Ah, just before posting, I see your latest with dmesg.  Just on a quick
scan, I wonder if these are a bad indication?  Maybe just a side-issue,
but powerd might not work, so again heat might be something to watch:
>
>  est0: <Enhanced SpeedStep Frequency Control> on cpu0
>  est: CPU supports Enhanced Speedstep, but is not recognized.
>
> cheers, Ian
> _______________________________________________
> freebsd-questions@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to
> "freebsd-questions-unsubscribe@freebsd.org"
>

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?33501.128.135.52.6.1496329407.squirrel>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation