Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 May 2016 12:20:07 -0400
From:      Dan Langille <dan@langille.org>
To:        Stephen McConnell <stephen.mcconnell@broadcom.com>
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: terminated ioc 804b scsi 0 state c xfer 0
Message-ID:  <DD765878-E988-4B11-B4E6-6E10FEC5B5BE@langille.org>
In-Reply-To: <068601d1bb57$f675f710$e361e530$@broadcom.com>
References:  <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <F4C52A2E-AC8D-4A6D-BA93-0D96C9090251@langille.org> <068601d1bb57$f675f710$e361e530$@broadcom.com>

next in thread | previous in thread | raw e-mail | index | archive | help
> On May 31, 2016, at 12:17 PM, Stephen McConnell =
<stephen.mcconnell@broadcom.com> wrote:
>=20
>=20
>=20
>> -----Original Message-----
>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>> scsi@freebsd.org] On Behalf Of Dan Langille
>> Sent: Monday, May 30, 2016 12:28 PM
>> To: freebsd-scsi@freebsd.org
>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>=20
>>> On Apr 25, 2016, at 12:38 PM, Stephen McConnell
>> <stephen.mcconnell@broadcom.com> wrote:
>>>=20
>>>=20
>>>=20
>>>> -----Original Message-----
>>>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>>>> scsi@freebsd.org] On Behalf Of Dan Langille
>>>> Sent: Monday, April 25, 2016 9:40 AM
>>>> To: freebsd-scsi@freebsd.org
>>>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
>>>>=20
>>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> =
wrote:
>>>>>=20
>>>>>>=20
>>>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> =
wrote:
>>>>>>=20
>>>>>> More of the pasted output is also at
>>>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
>>>> =
<https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>;
>>>> and added smartctl output.
>>>>>>=20
>>>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
>>>>>> SAS2008
>>>> card.  It's running a zfs root system.
>>>>>>=20
>>>>>> This morning the system was unresponsive via ssh. Attempts to log
>>>>>> in at
>>>> the console did not yield a password prompt.
>>>>>>=20
>>>>>> A power cycle brought the system online.  Inspecting
>>>>>> /var/log/messages,
>>> I
>>>> found about 63,000 entries similar to those which appear below.
>>>>>>=20
>>>>>> zpool status of all are OK. A scrub is in progress for one pool
>>>>>> (since
>>> before
>>>> this issue arose). da7 is in that pool.
>>>>>>=20
>>>>>>=20
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc =
804b
>>>>>> scsi
>>>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID =
614
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
>>>>>> 20
>>>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer =
0
>>>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
>>>>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
> (da7:mps1:0:17:0):
>>>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
>>>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
>>>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 =
00
>>>>>> 18
>>>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
>>>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB:
>>>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
>>>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
>>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
>>>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0 =
Apr
>>>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 =
8b
>>>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b =
scsi
>>>>>> 0 state c xfer 0
>>>>>>=20
>>>>>> Is this a cabling issue?  The drive is a SATA device (smartctl
>>>>>> output
>>> in the
>>>> URL above).  Anyone familiar with these errors?
>>>>>=20
>>>>> This morning:
>>>>>=20
>>>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
>>>>>=20
>>>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
>>>>>=20
>>>>> =46rom the logs:
>>>>>=20
>>>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 =
lun
>>>>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> =
s/n
>>>> 13Q8PNBYS detached
>>=20
>> Just for the record, this happened again this morning. Fixed by power
> cycle.
>>=20
>> May 30 03:22:08 knew kernel: mps1: mpssas_prepare_remove: Sending =
reset
>> for target ID 17 May 30 03:22:10 knew kernel: da7 at mps1 bus 0 =
scbus1
> target
>> 17 lun 0
>> May 30 03:22:10 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0> s/n
>> 13Q8PNBYS detached
>> May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 =
8c 5c
>> 91 c0 00 00 08 00 length 4096 SMID 179 terminated ioc 804b scsi 0 =
state c
> xfer
>> 0 May 30 03:22:10 knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a =
00 6b
>> bf db a0 00 00 f0 00 length 122880 SMID 938 terminated ioc 804b scsi =
0
> state c
>> xf(da7:mps1:0:17:0): READ(10). CDB: 28 00 8c 5c 91 c0 00 00 08 00 May =
30
>> 03:22:10 knew kernel: er 122880
>>=20
> I just realized that you're using mps, not mpr.  The fix went into the =
mpr
> driver, but not mps yet.  It'll have to be ported over to mps.

This hit me again last night.  Same drive again.  Power cycle cleared =
it.

Now I'm wondering if it's heat or dud drive related.

>=20
> Steve
>=20
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b =
scsi 0
>>>>> state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
>>>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
>>>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): =
READ(10).
> CDB:
>>> 28
>>>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
> 05:34:51
>>>> knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally =
Re-queue
>>>> Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). =
CDB:
>> 28
>>>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc =
804b
> scsi
>>> 0
>>>> state c xfer(da7: 0
>>>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): =
READ(10).
>>> CDB:
>>>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc =
804b
>>> scsi 0
>>>> state c xfer 17:0
>>>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): =
WRITE(10).
>>> CDB: 2a
>>>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc =
804b
> scsi
>>> 0
>>>> state c xfeError 5, Periph was invalidated
>>>>> Apr 25 05:34:51 knew kernel: r 0
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 =
00
>>>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00 =
length
>>>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
>>>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request =
Apr 25
>>>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
>>>>> mps1:0:IOCStatus =3D 0x4b while resetting device 0xa Apr 25 =
05:34:51
>>>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
>>>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25 =
05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33 26 d8 =
00 00
>>>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 =
00 00
>>>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
>>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
>>>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90 =
00
>>>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM =
status:
>>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
>>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51 =
knew
>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 =
08 00
>>> Apr
>>>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: =
Unconditionally
>>> Re-
>>>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error =
5,
>>>> Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
>> Device
>>>> swap: provider da7p2 disconnected.
>>>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
> 'vdev
>>> is
>>>> removed, pool_guid=3D15378250086669402288
>>>> vdev_guid=3D13410079654596185797''
>>>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed =
Apr
>>>>> 25 05:34:51 knew ZFS: vdev is removed,
>> pool_guid=3D15378250086669402288
>>>>> vdev_guid=3D13410079654596185797
>>>>=20
>>>> Current status: after powering off the box, reseating the cables =
for
> that
>>> drive, I
>>>> powered up the system and a resilver commenced which completed in =
30
>>>> minutes.
>>>>=20
>>>> Seems OK now.  I am not sure if the two events are related.
>>>=20
>>> Recently, a bug was uncovered where a device is gets 'lost'.
>>>=20
>>> Here's what happens:
>>> The first message in your "failure on Monday"  log is for
>>> 'mpssas_prepare_remove'.  This message is likely logged because the =
FW
>> sends
>>> an event to the driver that the device is no longer responsive =
(pulled,
>>> cable issue, or something else).  When the driver gets this event, =
it
> sends
>>> a reset to the device to clear out any pending I/O.  This is where =
the
>>> 'terminated ioc' messages come from.  When the reset completes, the
> driver
>>> is supposed to send a SAS_IO_UNIT message to FW so that the =
DevHandle
>> for
>>> that disk is removed from FW's list.  Then, when the device comes =
back
>>> on-line, everything is fine. But, with this bug, before that =
SAS_IO_UNIT
>>> message is sent to FW, the driver exits the function where that =
happens
>>> (mpssas_remove_device).  This happens where you see the log message,
>>> "IOCStatus - 0x4b while resetting device 0x0a".  The driver logs =
that
>>> message and then exits.  What the driver should do is log that =
message
> and
>>> continue on to send the SAS_IO_UNIT message to FW.  The fix is to =
remove
>> the
>>> two lines in the driver shown here with '>>':
>>>=20
>>> 	if (le16toh(reply->IOCStatus) !=3D MPI2_IOCSTATUS_SUCCESS) {
>>> 		mps_printf(sc, "IOCStatus =3D 0x%x while resetting =
device
>>> 0x%x\n",
>>> 		   le16toh(reply->IOCStatus), handle);
>>>>> 		mpssas_free_tm(sc, tm);
>>>>> 		return;
>>> 	}
>>>=20
>>> A reboot will solve the problem, as you saw, but the real fix is to
> remove
>>> the DevHandle as described above.  This fix will go into the driver
> before
>>> the next scheduled release and then MFC'd to 10.x.
>>>=20
>>> Steve
>>>=20
>>>>=20
>>>> --
>>>> Dan Langille - BSDCan / PGCon
>>>> dan@langille.org
>>>>=20
>>>>=20
>>>>=20
>>>>=20
>>>> _______________________________________________
>>>> freebsd-scsi@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>>=20
>>=20
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>=20




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DD765878-E988-4B11-B4E6-6E10FEC5B5BE>