Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 06 Feb 2019 18:27:43 +0000
From:      bugzilla-noreply@freebsd.org
To:        bugs@FreeBSD.org
Subject:   [Bug 235559] 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1)
Message-ID:  <bug-235559-227@https.bugs.freebsd.org/bugzilla/>

next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D235559

            Bug ID: 235559
           Summary: 12.0-STABLE panics on mps drive problem (regression
                    from 11.2 and double-regression from 11.1)
           Product: Base System
           Version: 12.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: karl@denninger.net

Created attachment 201796
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D201796&action=
=3Dedit
Core from latest kernel panic

On 11.1, this system was completely stable.

I upgraded to 11.2 and started getting CAM timeouts / retries, which I star=
ted
a thread on at
https://lists.freebsd.org/pipermail/freebsd-stable/2019-February/090520.html

Note that the card firmware is 19.00.00.00; running 20.00.07.00 (latest
available) instead of CAM problems with individual drives I get controller
resets, which are *far* worse as the impact is not local.  In no case, howe=
ver,
has data been corrupted -- ZFS is happy with the data and shows no pack err=
ors
of any sort, nor do the disks themselves using smartctl.  The retries are
successful.

The configuration is a LSI 8-port HBA with a Lenovo 24-port expander attach=
ed
to one of the LSI connectors; the other has the boot drives on it, as the
system and card firmware cannot boot from the expander.  This configuration=
 has
been stable for the last several years and up to 11.1-STABLE was flawless. =
 The
drives themselves, backplanes to which they attach, power supply, HBA, SAS
expander and cables have all been swapped out with spares here without any
change in behavior.  The motherboard itself is a XEON with ECC and no RAM
errors are being logged.  (It's thus reasonable to assume this isn't a hard=
ware
problem....)

The stall and retry itself looks an awful lot like a queued command is being
missed or an interrupt lost, both under very heavy load.  This typically oc=
curs
only when the drives in question are slammed at 100% utilization or nearly =
so
for an extended period of time (e.g. during a scrub or resilver.)  I have s=
een
it on both HGST and Seagate drives of differing capacities, model and firmw=
are
revision numbers; it does not appear to be related to the disk model or
firmware itself.

In an attempt to see if this was related to something in 11.2 I rolled the
machine forward to 12.0-STABLE.  On 12.0-STABLE, r343809, this same conditi=
on
rather than producing console logs and a successful retry instead results i=
n a
kernel panic in the driver.  The disk I/O in process at the time is a ZFS s=
crub
and the drive in question is pure data -- it has no executables on it, and =
in
fact the pool has no mounted filesystems at the time of the panic (it's a
backup pool that is imported to serve as a destination for zfs sends used a=
s a
means of backup.)

I have ordered a pair of HBA 16i cards in order to get the expander out of =
the
case in the hope that will stop the detach events, although I am completely
lost in terms of why 11.2 and 12.0 will not run with that configuration whe=
re
it was entirely stable over the last several releases up through 11.1 with
uptimes measured in months; until 11.2 I had never seen even a single panic=
 out
of the disk subsystem on this configuration.

Note that if you have all disks attached to the mps driver you can't take a
kernel core dump when it happens; any attempt to do so results in a
double-panic out of the driver.  I have temporarily attached a drive to the
onboard SATA ports and set it as dumpdev so as to be able to get the core f=
ile.

The panic itself bodes poorly for the impact of potential disk problems (re=
al
ones) where a drive goes offline when attached to the mps driver in 12.0, t=
hus
this bug report in an attempt to figure out this regression.

--=20
You are receiving this mail because:
You are the assignee for the bug.=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-235559-227>