Date: Tue, 31 May 2016 12:20:07 -0400 From: Dan Langille <dan@langille.org> To: Stephen McConnell <stephen.mcconnell@broadcom.com> Cc: freebsd-scsi@freebsd.org Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 Message-ID: <DD765878-E988-4B11-B4E6-6E10FEC5B5BE@langille.org> In-Reply-To: <068601d1bb57$f675f710$e361e530$@broadcom.com> References: <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org> <068601d1bb57$f675f710$e361e530$@broadcom.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> On May 31, 2016, at 12:17 PM, Stephen McConnell = <stephen.mcconnell@broadcom.com> wrote: >=20 >=20 >=20 >> -----Original Message----- >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >> scsi@freebsd.org] On Behalf Of Dan Langille >> Sent: Monday, May 30, 2016 12:28 PM >> To: freebsd-scsi@freebsd.org >> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 >>=20 >>> On Apr 25, 2016, at 12:38 PM, Stephen McConnell >> <stephen.mcconnell@broadcom.com> wrote: >>>=20 >>>=20 >>>=20 >>>> -----Original Message----- >>>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >>>> scsi@freebsd.org] On Behalf Of Dan Langille >>>> Sent: Monday, April 25, 2016 9:40 AM >>>> To: freebsd-scsi@freebsd.org >>>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0 >>>>=20 >>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> = wrote: >>>>>=20 >>>>>>=20 >>>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> = wrote: >>>>>>=20 >>>>>> More of the pasted output is also at >>>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae >>>> = <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae> >>>> and added smartctl output. >>>>>>=20 >>>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI >>>>>> SAS2008 >>>> card. It's running a zfs root system. >>>>>>=20 >>>>>> This morning the system was unresponsive via ssh. Attempts to log >>>>>> in at >>>> the console did not yield a password prompt. >>>>>>=20 >>>>>> A power cycle brought the system online. Inspecting >>>>>> /var/log/messages, >>> I >>>> found about 63,000 entries similar to those which appear below. >>>>>>=20 >>>>>> zpool status of all are OK. A scrub is in progress for one pool >>>>>> (since >>> before >>>> this issue arose). da7 is in that pool. >>>>>>=20 >>>>>>=20 >>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 >>>>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc = 804b >>>>>> scsi >>>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): >>>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID = 614 >>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00 >>>>>> 20 >>>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer = 0 >>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 >>>>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel: > (da7:mps1:0:17:0): >>>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674 >>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew >>>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 = 00 >>>>>> 18 >>>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer >>>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: >>>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel: >>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 >>>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 = Apr >>>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8b >>>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b = scsi >>>>>> 0 state c xfer 0 >>>>>>=20 >>>>>> Is this a cabling issue? The drive is a SATA device (smartctl >>>>>> output >>> in the >>>> URL above). Anyone familiar with these errors? >>>>>=20 >>>>> This morning: >>>>>=20 >>>>> 13410079654596185797 REMOVED 0 0 0 was /dev/da7p3 >>>>>=20 >>>>> At least I know i'm looking for Serial Number: 13Q8PNBYS >>>>>=20 >>>>> =46rom the logs: >>>>>=20 >>>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 = lun >>>>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> = s/n >>>> 13Q8PNBYS detached >>=20 >> Just for the record, this happened again this morning. Fixed by power > cycle. >>=20 >> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending = reset >> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 = scbus1 > target >> 17 lun 0 >> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n >> 13Q8PNBYS detached >> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 = 8c 5c >> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 = state c > xfer >> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a = 00 6b >> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi = 0 > state c >> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May = 30 >> 03:22:10 knew kernel: er 122880 >>=20 > I just realized that you're using mps, not mpr. The fix went into the = mpr > driver, but not mps yet. It'll have to be ported over to mps. This hit me again last night. Same drive again. Power cycle cleared = it. Now I'm wondering if it's heat or dud drive related. >=20 > Steve >=20 >>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b = scsi 0 >>>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): >>>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204 >>>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): = READ(10). > CDB: >>> 28 >>>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25 > 05:34:51 >>>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally = Re-queue >>>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). = CDB: >> 28 >>>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc = 804b > scsi >>> 0 >>>> state c xfer(da7: 0 >>>>> Apr 25 05:34:51 knew kernel: mps1:0: (da7:mps1:0:17:0): = READ(10). >>> CDB: >>>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc = 804b >>> scsi 0 >>>> state c xfer 17:0 >>>>> Apr 25 05:34:51 knew kernel: 0): (da7:mps1:0:17:0): = WRITE(10). >>> CDB: 2a >>>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc = 804b > scsi >>> 0 >>>> state c xfeError 5, Periph was invalidated >>>>> Apr 25 05:34:51 knew kernel: r 0 >>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 = 00 >>>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel: >>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 = length >>>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer >>>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request = Apr 25 >>>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1: >>>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 = 05:34:51 >>>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25 >>>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 = 05:34:51 >>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 = 00 00 >>>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 = 00 00 >>>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: >>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 >>>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 = 00 >>>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM = status: >>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel: >>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 = knew >>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 = 08 00 >>> Apr >>>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: = Unconditionally >>> Re- >>>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error = 5, >>>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR: >> Device >>>> swap: provider da7p2 disconnected. >>>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS > 'vdev >>> is >>>> removed, pool_guid=3D15378250086669402288 >>>> vdev_guid=3D13410079654596185797'' >>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed = Apr >>>>> 25 05:34:51 knew ZFS: vdev is removed, >> pool_guid=3D15378250086669402288 >>>>> vdev_guid=3D13410079654596185797 >>>>=20 >>>> Current status: after powering off the box, reseating the cables = for > that >>> drive, I >>>> powered up the system and a resilver commenced which completed in = 30 >>>> minutes. >>>>=20 >>>> Seems OK now. I am not sure if the two events are related. >>>=20 >>> Recently, a bug was uncovered where a device is gets 'lost'. >>>=20 >>> Here's what happens: >>> The first message in your "failure on Monday" log is for >>> 'mpssas_prepare_remove'. This message is likely logged because the = FW >> sends >>> an event to the driver that the device is no longer responsive = (pulled, >>> cable issue, or something else). When the driver gets this event, = it > sends >>> a reset to the device to clear out any pending I/O. This is where = the >>> 'terminated ioc' messages come from. When the reset completes, the > driver >>> is supposed to send a SAS_IO_UNIT message to FW so that the = DevHandle >> for >>> that disk is removed from FW's list. Then, when the device comes = back >>> on-line, everything is fine. But, with this bug, before that = SAS_IO_UNIT >>> message is sent to FW, the driver exits the function where that = happens >>> (mpssas_remove_device). This happens where you see the log message, >>> "IOCStatus - 0x4b while resetting device 0x0a". The driver logs = that >>> message and then exits. What the driver should do is log that = message > and >>> continue on to send the SAS_IO_UNIT message to FW. The fix is to = remove >> the >>> two lines in the driver shown here with '>>': >>>=20 >>> if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) { >>> mps_printf(sc, "IOCStatus =3D 0x%x while resetting = device >>> 0x%x\n", >>> le16toh(reply->IOCStatus), handle); >>>>> mpssas_free_tm(sc, tm); >>>>> return; >>> } >>>=20 >>> A reboot will solve the problem, as you saw, but the real fix is to > remove >>> the DevHandle as described above. This fix will go into the driver > before >>> the next scheduled release and then MFC'd to 10.x. >>>=20 >>> Steve >>>=20 >>>>=20 >>>> -- >>>> Dan Langille - BSDCan / PGCon >>>> dan@langille.org >>>>=20 >>>>=20 >>>>=20 >>>>=20 >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>>=20 >>=20 >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DD765878-E988-4B11-B4E6-6E10FEC5B5BE>