FreeBSD Mail Archives

Date:      Wed, 06 Feb 2019 18:27:43 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 235559] 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1)
Message-ID:  <bug-235559-227@https.bugs.freebsd.org/bugzilla/>

index | next in thread | raw e-mail


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235559

            Bug ID: 235559
           Summary: 12.0-STABLE panics on mps drive problem (regression
                    from 11.2 and double-regression from 11.1)
           Product: Base System
           Version: 12.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: karl@denninger.net

Created attachment 201796
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=201796&action=edit
Core from latest kernel panic

On 11.1, this system was completely stable.

I upgraded to 11.2 and started getting CAM timeouts / retries, which I started
a thread on at
https://lists.freebsd.org/pipermail/freebsd-stable/2019-February/090520.html

Note that the card firmware is 19.00.00.00; running 20.00.07.00 (latest
available) instead of CAM problems with individual drives I get controller
resets, which are *far* worse as the impact is not local.  In no case, however,
has data been corrupted -- ZFS is happy with the data and shows no pack errors
of any sort, nor do the disks themselves using smartctl.  The retries are
successful.

The configuration is a LSI 8-port HBA with a Lenovo 24-port expander attached
to one of the LSI connectors; the other has the boot drives on it, as the
system and card firmware cannot boot from the expander.  This configuration has
been stable for the last several years and up to 11.1-STABLE was flawless.  The
drives themselves, backplanes to which they attach, power supply, HBA, SAS
expander and cables have all been swapped out with spares here without any
change in behavior.  The motherboard itself is a XEON with ECC and no RAM
errors are being logged.  (It's thus reasonable to assume this isn't a hardware
problem....)

The stall and retry itself looks an awful lot like a queued command is being
missed or an interrupt lost, both under very heavy load.  This typically occurs
only when the drives in question are slammed at 100% utilization or nearly so
for an extended period of time (e.g. during a scrub or resilver.)  I have seen
it on both HGST and Seagate drives of differing capacities, model and firmware
revision numbers; it does not appear to be related to the disk model or
firmware itself.

In an attempt to see if this was related to something in 11.2 I rolled the
machine forward to 12.0-STABLE.  On 12.0-STABLE, r343809, this same condition
rather than producing console logs and a successful retry instead results in a
kernel panic in the driver.  The disk I/O in process at the time is a ZFS scrub
and the drive in question is pure data -- it has no executables on it, and in
fact the pool has no mounted filesystems at the time of the panic (it's a
backup pool that is imported to serve as a destination for zfs sends used as a
means of backup.)

I have ordered a pair of HBA 16i cards in order to get the expander out of the
case in the hope that will stop the detach events, although I am completely
lost in terms of why 11.2 and 12.0 will not run with that configuration where
it was entirely stable over the last several releases up through 11.1 with
uptimes measured in months; until 11.2 I had never seen even a single panic out
of the disk subsystem on this configuration.

Note that if you have all disks attached to the mps driver you can't take a
kernel core dump when it happens; any attempt to do so results in a
double-panic out of the driver.  I have temporarily attached a drive to the
onboard SATA ports and set it as dumpdev so as to be able to get the core file.

The panic itself bodes poorly for the impact of potential disk problems (real
ones) where a drive goes offline when attached to the mps driver in 12.0, thus
this bug report in an attempt to figure out this regression.

-- 
You are receiving this mail because:
You are the assignee for the bug.

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-235559-227>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation