Date: Wed, 06 Feb 2019 18:27:43 +0000 From: bugzilla-noreply@freebsd.org To: bugs@FreeBSD.org Subject: [Bug 235559] 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1) Message-ID: <bug-235559-227@https.bugs.freebsd.org/bugzilla/>
next in thread | raw e-mail | index | archive | help
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D235559 Bug ID: 235559 Summary: 12.0-STABLE panics on mps drive problem (regression from 11.2 and double-regression from 11.1) Product: Base System Version: 12.0-STABLE Hardware: amd64 OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: bugs@FreeBSD.org Reporter: karl@denninger.net Created attachment 201796 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D201796&action= =3Dedit Core from latest kernel panic On 11.1, this system was completely stable. I upgraded to 11.2 and started getting CAM timeouts / retries, which I star= ted a thread on at https://lists.freebsd.org/pipermail/freebsd-stable/2019-February/090520.html Note that the card firmware is 19.00.00.00; running 20.00.07.00 (latest available) instead of CAM problems with individual drives I get controller resets, which are *far* worse as the impact is not local. In no case, howe= ver, has data been corrupted -- ZFS is happy with the data and shows no pack err= ors of any sort, nor do the disks themselves using smartctl. The retries are successful. The configuration is a LSI 8-port HBA with a Lenovo 24-port expander attach= ed to one of the LSI connectors; the other has the boot drives on it, as the system and card firmware cannot boot from the expander. This configuration= has been stable for the last several years and up to 11.1-STABLE was flawless. = The drives themselves, backplanes to which they attach, power supply, HBA, SAS expander and cables have all been swapped out with spares here without any change in behavior. The motherboard itself is a XEON with ECC and no RAM errors are being logged. (It's thus reasonable to assume this isn't a hard= ware problem....) The stall and retry itself looks an awful lot like a queued command is being missed or an interrupt lost, both under very heavy load. This typically oc= curs only when the drives in question are slammed at 100% utilization or nearly = so for an extended period of time (e.g. during a scrub or resilver.) I have s= een it on both HGST and Seagate drives of differing capacities, model and firmw= are revision numbers; it does not appear to be related to the disk model or firmware itself. In an attempt to see if this was related to something in 11.2 I rolled the machine forward to 12.0-STABLE. On 12.0-STABLE, r343809, this same conditi= on rather than producing console logs and a successful retry instead results i= n a kernel panic in the driver. The disk I/O in process at the time is a ZFS s= crub and the drive in question is pure data -- it has no executables on it, and = in fact the pool has no mounted filesystems at the time of the panic (it's a backup pool that is imported to serve as a destination for zfs sends used a= s a means of backup.) I have ordered a pair of HBA 16i cards in order to get the expander out of = the case in the hope that will stop the detach events, although I am completely lost in terms of why 11.2 and 12.0 will not run with that configuration whe= re it was entirely stable over the last several releases up through 11.1 with uptimes measured in months; until 11.2 I had never seen even a single panic= out of the disk subsystem on this configuration. Note that if you have all disks attached to the mps driver you can't take a kernel core dump when it happens; any attempt to do so results in a double-panic out of the driver. I have temporarily attached a drive to the onboard SATA ports and set it as dumpdev so as to be able to get the core f= ile. The panic itself bodes poorly for the impact of potential disk problems (re= al ones) where a drive goes offline when attached to the mps driver in 12.0, t= hus this bug report in an attempt to figure out this regression. --=20 You are receiving this mail because: You are the assignee for the bug.=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?bug-235559-227>