Date: Tue, 31 May 2016 10:17:41 -0600 From: "Stephen McConnell" <stephen.mcconnell@broadcom.com> To: "'Dan Langille'" <dan@langille.org>, <freebsd-scsi@freebsd.org> Subject: RE: terminated ioc 804b scsi 0 state c xfer 0 Message-ID: <068601d1bb57$f675f710$e361e530$@broadcom.com> In-Reply-To: <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> -----Original Message----- > From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > scsi@freebsd.org] On Behalf Of Dan Langille > Sent: Monday, May 30, 2016 12:28 PM > To: freebsd-scsi@freebsd.org > Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 > > > On Apr 25, 2016, at 12:38 PM, Stephen McConnell > <stephen.mcconnell@broadcom.com> wrote: > > > > > > > >> -----Original Message----- > >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- > >> scsi@freebsd.org] On Behalf Of Dan Langille > >> Sent: Monday, April 25, 2016 9:40 AM > >> To: freebsd-scsi@freebsd.org > >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 > >> > >>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote: > >>> > >>>> > >>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote: > >>>> > >>>> More of the pasted output is also at > >> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae > >> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae> > >> and added smartctl output. > >>>> > >>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI > >>>> SAS2008 > >> card. It's running a zfs root system. > >>>> > >>>> This morning the system was unresponsive via ssh. Attempts to log > >>>> in at > >> the console did not yield a password prompt. > >>>> > >>>> A power cycle brought the system online. Inspecting > >>>> /var/log/messages, > > I > >> found about 63,000 entries similar to those which appear below. > >>>> > >>>> zpool status of all are OK. A scrub is in progress for one pool > >>>> (since > > before > >> this issue arose). da7 is in that pool. > >>>> > >>>> > >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 > >>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc 804b > >>>> scsi > >>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614 > >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 > >>>> 20 > >>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0 > >>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 > >>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc > >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): > >>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 > >>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew > >>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00 > >>>> 18 > >>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer > >>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: > >>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc > >>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: > >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 > >>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 Apr > >>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b > >>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b scsi > >>>> 0 state c xfer 0 > >>>> > >>>> Is this a cabling issue? The drive is a SATA device (smartctl > >>>> output > > in the > >> URL above). Anyone familiar with these errors? > >>> > >>> This morning: > >>> > >>> 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 > >>> > >>> At least I know i'm looking for Serial Number: 13Q8PNBYS > >>> > >>> From the logs: > >>> > >>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun > >>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n > >> 13Q8PNBYS detached > > Just for the record, this happened again this morning. Fixed by power cycle. > > May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending reset > for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 scbus1 target > 17 lun 0 > May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n > 13Q8PNBYS detached > May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c > 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 state c xfer > 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 6b > bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi 0 state c > xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May 30 > 03:22:10 knew kernel: er 122880 > I just realized that you're using mps, not mpr. The fix went into the mpr driver, but not mps yet. It'll have to be ported over to mps. Steve > >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b scsi 0 > >>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): > >>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 > >>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB: > > 28 > >> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 05:34:51 > >> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue > >> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: > 28 > >> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b scsi > > 0 > >> state c xfer(da7: 0 > >>> Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): READ(10). > > CDB: > >> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc 804b > > scsi 0 > >> state c xfer 17:0 > >>> Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): WRITE(10). > > CDB: 2a > >> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b scsi > > 0 > >> state c xfeError 5, Periph was invalidated > >>> Apr 25 05:34:51 knew kernel: r 0 > >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 > >>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: > >>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 length > >>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer > >>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr 25 > >>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: > >>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51 > >>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 > >>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 05:34:51 > >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 00 00 > >>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > >>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00 00 > >>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 > >>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 00 > >>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: > >> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: > >> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 knew > >> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 > > Apr > >> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally > > Re- > >> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error 5, > >> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: > Device > >> swap: provider da7p2 disconnected. > >>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS 'vdev > > is > >> removed, pool_guid=15378250086669402288 > >> vdev_guid=13410079654596185797'' > >>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed Apr > >>> 25 05:34:51 knew ZFS: vdev is removed, > pool_guid=15378250086669402288 > >>> vdev_guid=13410079654596185797 > >> > >> Current status: after powering off the box, reseating the cables for that > > drive, I > >> powered up the system and a resilver commenced which completed in 30 > >> minutes. > >> > >> Seems OK now. I am not sure if the two events are related. > > > > Recently, a bug was uncovered where a device is gets 'lost'. > > > > Here's what happens: > > The first message in your "failure on Monday" log is for > > 'mpssas_prepare_remove'. This message is likely logged because the FW > sends > > an event to the driver that the device is no longer responsive (pulled, > > cable issue, or something else). When the driver gets this event, it sends > > a reset to the device to clear out any pending I/O. This is where the > > 'terminated ioc' messages come from. When the reset completes, the driver > > is supposed to send a SAS_IO_UNIT message to FW so that the DevHandle > for > > that disk is removed from FW's list. Then, when the device comes back > > on-line, everything is fine. But, with this bug, before that SAS_IO_UNIT > > message is sent to FW, the driver exits the function where that happens > > (mpssas_remove_device). This happens where you see the log message, > > "IOCStatus - 0x4b while resetting device 0x0a". The driver logs that > > message and then exits. What the driver should do is log that message and > > continue on to send the SAS_IO_UNIT message to FW. The fix is to remove > the > > two lines in the driver shown here with '>>': > > > > if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) { > > mps_printf(sc, "IOCStatus = 0x%x while resetting device > > 0x%x\n", > > le16toh(reply->IOCStatus), handle); > >>> mpssas_free_tm(sc, tm); > >>> return; > > } > > > > A reboot will solve the problem, as you saw, but the real fix is to remove > > the DevHandle as described above. This fix will go into the driver before > > the next scheduled release and then MFC'd to 10.x. > > > > Steve > > > >> > >> -- > >> Dan Langille - BSDCan / PGCon > >> dan@langille.org > >> > >> > >> > >> > >> _______________________________________________ > >> freebsd-scsi@freebsd.org mailing list > >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?068601d1bb57$f675f710$e361e530$>