Date: Tue, 10 May 2016 06:10:00 -0500 From: Josh Paetzel <josh@tcbug.org> To: Stephen McConnell <stephen.mcconnell@broadcom.com> Cc: freebsd-scsi@freebsd.org, Scott Long <scott4long@yahoo.com> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 Message-ID: <0BA358CB-779B-460E-9FFC-FA033C136825@tcbug.org> In-Reply-To: <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Apr 25, 2016, at 12:30 PM, Scott Long via freebsd-scsi <freebsd-scsi@fr= eebsd.org> wrote: >=20 >=20 >> On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi <freebsd= -scsi@freebsd.org> wrote: >>=20 >>=20 >>=20 >>> -----Original Message----- >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >>> scsi@freebsd.org] On Behalf Of Dan Langille >>> Sent: Monday, April 25, 2016 9:40 AM >>> To: freebsd-scsi@freebsd.org >>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 >>>=20 >>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote: >>>>>=20 >>>>>=20 >>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote: >>>>>=20 >>>>> More of the pasted output is also at >>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae >>> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae> >>> and added smartctl output. >>>>>=20 >>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI SAS2008= >>> card. It's running a zfs root system. >>>>>=20 >>>>> This morning the system was unresponsive via ssh. Attempts to log in a= t >>> the console did not yield a password prompt. >>>>>=20 >>>>> A power cycle brought the system online. Inspecting /var/log/messages= , >> I >>> found about 63,000 entries similar to those which appear below. >>>>>=20 >>>>> zpool status of all are OK. A scrub is in progress for one pool (since= >> before >>> this issue arose). da7 is in that pool. >>>>>=20 >>>>>=20 >>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>>>> 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b scsi >>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): >>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 >>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 20 >>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 >>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>>>> 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc 804b >>>>> scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): >>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 >>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 18 >>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer >>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: >>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc >>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: >>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 >>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr >>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b >>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi 0 >>>>> state c xfer 0 >>>>>=20 >>>>> Is this a cabling issue? The drive is a SATA device (smartctl output >> in the >>> URL above). Anyone familiar with these errors? >>>>=20 >>>> This morning: >>>>=20 >>>> 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 >>>>=20 >>>> At least I know i'm looking for Serial Number: 13Q8PNBYS >>>>=20 >>>> =46rom the logs: >>>>=20 >>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun 0 >>>> Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n >>> 13Q8PNBYS detached >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 >>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): >>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 >>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB= : >> 28 >>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:5= 1 >>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue >>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 2= 8 >>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scs= i >> 0 >>> state c xfer(da7: 0 >>>> Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). >> CDB: >>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b >> scsi 0 >>> state c xfer 17:0 >>>> Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). >> CDB: 2a >>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scs= i >> 0 >>> state c xfeError 5, Periph was invalidated >>>> Apr 25 05:34:51 knew kernel: r 0 >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 >>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length >>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer >>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25 >>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: >>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 05:34:51 >>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 >>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51 >>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00 >>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00 >>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 >>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew >>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 >> Apr >>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally >> Re- >>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, >>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: Device >>> swap: provider da7p2 disconnected. >>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vde= v >> is >>> removed, pool_guid=3D15378250086669402288 >>> vdev_guid=3D13410079654596185797'' >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr >>>> 25 05:34:51 knew ZFS: vdev is removed, pool_guid=3D15378250086669402288= >>>> vdev_guid=3D13410079654596185797 >>>=20 >>> Current status: after powering off the box, reseating the cables for tha= t >> drive, I >>> powered up the system and a resilver commenced which completed in 30 >>> minutes. >>>=20 >>> Seems OK now. I am not sure if the two events are related. >>=20 >> Recently, a bug was uncovered where a device is gets 'lost'. >>=20 >> Here's what happens: >> The first message in your "failure on Monday" log is for >> 'mpssas_prepare_remove'. This message is likely logged because the FW se= nds >> an event to the driver that the device is no longer responsive (pulled, >> cable issue, or something else). When the driver gets this event, it sen= ds >> a reset to the device to clear out any pending I/O. This is where the >> 'terminated ioc' messages come from. When the reset completes, the drive= r >> is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle for= >> that disk is removed from FW's list. Then, when the device comes back >> on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT >> message is sent to FW, the driver exits the function where that happens >> (mpssas_remove_device). This happens where you see the log message, >> "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that >> message and then exits. What the driver should do is log that message an= d >> continue on to send the SAS_IO_UNIT message to FW. The fix is to remove t= he >> two lines in the driver shown here with '>>': >>=20 >> if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) { >> mps_printf(sc, "IOCStatus =3D 0x%x while resetting device >> 0x%x\n", >> le16toh(reply->IOCStatus), handle); >>>> mpssas_free_tm(sc, tm); >>>> return; >> } >>=20 >> A reboot will solve the problem, as you saw, but the real fix is to remov= e >> the DevHandle as described above. This fix will go into the driver befor= e >> the next scheduled release and then MFC'd to 10.x. >=20 > Thanks for the diagnosis, Steve. I forgot about that case. We should als= o make > this chain of events more evident in the syslog, it=E2=80=99s very confusi= ng when it happens. > I=E2=80=99m not exactly sure yet what it should look like. >=20 > Scott >=20 Steve, We are seeing this same error case for mpr and mps in FreeNAS. We are decoup= led from FreeBSD releases and have an auto-update mechanism. I'd be very happy to test patches against stable/10 on our nightly users.=20=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0BA358CB-779B-460E-9FFC-FA033C136825>