FreeBSD Mail Archives

Date:      Tue, 10 May 2016 09:15:18 -0600
From:      "Stephen McConnell" <stephen.mcconnell@broadcom.com>
To:        "'Josh Paetzel'" <josh@tcbug.org>
Cc:        <freebsd-scsi@freebsd.org>, "'Scott Long'" <scott4long@yahoo.com>
Subject:   RE: terminated ioc 804b scsi 0 state c xfer 0
Message-ID:  <22c001d1aace$c41ffa70$4c5fef50$@broadcom.com>
In-Reply-To: <0BA358CB-779B-460E-9FFC-FA033C136825@tcbug.org>
References:  <2E8752E5-76AF-4042-86D9-8C6733658A80@langille.org> <5EEF0794-B06E-4A72-89DA-7DCD94AE1FC6@langille.org> <072CEC8B-9392-4378-8DF5-63D05901850B@langille.org> <0d7401d19f10$ee329300$ca97b900$@broadcom.com> <3D06CE25-4159-4F30-A8C5-8A188144681B@yahoo.com> <0BA358CB-779B-460E-9FFC-FA033C136825@tcbug.org>

index | next in thread | previous in thread | raw e-mail




> -----Original Message-----
> From: Josh Paetzel [mailto:josh@tcbug.org]
> Sent: Tuesday, May 10, 2016 5:10 AM
> To: Stephen McConnell
> Cc: freebsd-scsi@freebsd.org; Scott Long
> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> 
> 
> 
> > On Apr 25, 2016, at 12:30 PM, Scott Long via freebsd-scsi <freebsd-
> scsi@freebsd.org> wrote:
> >
> >
> >> On Apr 25, 2016, at 10:38 AM, Stephen McConnell via freebsd-scsi
> <freebsd-scsi@freebsd.org> wrote:
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> >>> scsi@freebsd.org] On Behalf Of Dan Langille
> >>> Sent: Monday, April 25, 2016 9:40 AM
> >>> To: freebsd-scsi@freebsd.org
> >>> Subject: Re: terminated ioc 804b scsi 0 state c xfer 0
> >>>
> >>>>> On Apr 25, 2016, at 8:17 AM, Dan Langille <dan@langille.org> wrote:
> >>>>>
> >>>>>
> >>>>> On Apr 24, 2016, at 9:35 AM, Dan Langille <dan@langille.org> wrote:
> >>>>>
> >>>>> More of the pasted output is also at
> >>> https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae
> >>>
> <https://gist.github.com/dlangille/1fa3135334089c6603e2ec5da946d9ae>;
> >>> and added smartctl output.
> >>>>>
> >>>>> I have a FreeBSD 10.2-RELEASE-p14 box in which there is an LSI
> >>>>> SAS2008
> >>> card.  It's running a zfs root system.
> >>>>>
> >>>>> This morning the system was unresponsive via ssh. Attempts to log
> >>>>> in at
> >>> the console did not yield a password prompt.
> >>>>>
> >>>>> A power cycle brought the system online.  Inspecting
> >>>>> /var/log/messages,
> >> I
> >>> found about 63,000 entries similar to those which appear below.
> >>>>>
> >>>>> zpool status of all are OK. A scrub is in progress for one pool
> >>>>> (since
> >> before
> >>> this issue arose). da7 is in that pool.
> >>>>>
> >>>>>
> >>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>>> 00 8d 90 c6 18 00 00 10 00 length 8192 SMID 774 terminated ioc
> >>>>> 804b scsi
> >>>>> 0 state c xfer 0 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0):
> >>>>> READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00 length 16384 SMID 614
> >>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 50 00 00
> >>>>> 20
> >>>>> 00 length 16384 SMID 792 terminated ioc 804b scsi 0 state c xfer 0
> >>>>> Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>>> 00 8b d9 97 08 00 00 20 00 length 16384 SMID 974 terminated ioc
> >>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew kernel:
> (da7:mps1:0:17:0):
> >>>>> READ(10). CDB: 28 00 8b 6f ef 50 00 00 08 00 length 4096 SMID 674
> >>>>> terminated ioc 804b scsi 0 state c xfer 0 Apr 24 11:25:55 knew
> >>>>> kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 8b 0f a2 48 00 00
> >>>>> 18
> >>>>> 00 length 12288 SMID 177 terminated ioc 804b scsi 0 state c xfer
> >>>>> 12288 Apr 24 11:25:55 knew kernel: (da7:mps1:0:17:0): READ(10). CDB:
> >>>>> 28 00 ab 8f a1 38 00 00 08 00 length 4096 SMID 908 terminated ioc
> >>>>> 804b scsi 0 state c xfer 0 Apr 24 11:25:56 knew kernel:
> >>>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 8b d9 97 70 00 00 20 00
> >>>>> length 16384 SMID 376 terminated ioc 804b scsi 0 state c xfer 0
> >>>>> Apr
> >>>>> 24 11:25:56 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00
> >>>>> 8b
> >>>>> d9 97 50 00 00 20 00 length 16384 SMID 172 terminated ioc 804b
> >>>>> scsi 0 state c xfer 0
> >>>>>
> >>>>> Is this a cabling issue?  The drive is a SATA device (smartctl
> >>>>> output
> >> in the
> >>> URL above).  Anyone familiar with these errors?
> >>>>
> >>>> This morning:
> >>>>
> >>>> 13410079654596185797  REMOVED      0     0     0  was /dev/da7p3
> >>>>
> >>>> At least I know i'm looking for Serial Number: 13Q8PNBYS
> >>>>
> >>>> From the logs:
> >>>>
> >>>> Apr 25 05:34:50 knew kernel: da7 at mps1 bus 0 scbus1 target 17 lun
> >>>> 0 Apr 25 05:34:50 knew kernel: da7: <ATA TOSHIBA DT01ACA3 ABB0>
> s/n
> >>> 13Q8PNBYS detached
> >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00
> >>>> d8 33 53 e0 00 00 08 00 length 4096 SMID 88 terminated ioc 804b
> >>>> scsi 0 state c xfer 0 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
> >>>> READ(10). CDB: 28 00 d8 33 26 f8 00 00 20 00 length 16384 SMID 204
> >>>> terminated ioc 804b scsi 0 state c xfer(da7:mps1:0:17:0): READ(10). CDB:
> >> 28
> >>> 00 d8 33 53 e0 00 00 08 00 Apr 25 05:34:51 knew kernel: 0 Apr 25
> >>> 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status: Unconditionally
> >>> Re-queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0):
> >>> READ(10). CDB: 28
> >>> 00 d8 33 26 d8 00 00 20 00 length 16384 SMID 260 terminated ioc 804b
> >>> scsi
> >> 0
> >>> state c xfer(da7: 0
> >>>> Apr 25 05:34:51 knew kernel: mps1:0:    (da7:mps1:0:17:0): READ(10).
> >> CDB:
> >>> 28 00 e6 6c 42 40 00 00 10 00 length 8192 SMID 484 terminated ioc
> >>> 804b
> >> scsi 0
> >>> state c xfer 17:0
> >>>> Apr 25 05:34:51 knew kernel: 0):        (da7:mps1:0:17:0): WRITE(10).
> >> CDB: 2a
> >>> 00 e4 d8 2a 90 00 00 90 00 length 73728 SMID 548 terminated ioc 804b
> >>> scsi
> >> 0
> >>> state c xfeError 5, Periph was invalidated
> >>>> Apr 25 05:34:51 knew kernel: r 0
> >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28
> >>>> 00
> >>>> d8 33 26 f8 00 00 20 00 Apr 25 05:34:51 knew kernel:
> >>>> (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08 00
> >>>> length
> >>>> 4096 SMID 435 terminated ioc 804b scsi 0 state c xfer
> >>>> (da7:mps1:0:17:0): CAM status: Unconditionally Re-queue Request Apr
> >>>> 25
> >>>> 05:34:51 knew kernel: 0 Apr 25 05:34:51 knew kernel: (da7:mps1:
> >>>> mps1:0:IOCStatus = 0x4b while resetting device 0xa Apr 25 05:34:51
> >>>> knew kernel: 17:mps1: 0): Unfreezing devq for target ID 17 Apr 25
> >>>> 05:34:51 knew kernel: Error 5, Periph was invalidated Apr 25
> >>>> 05:34:51 knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 d8 33
> >>>> 26 d8 00 00
> >>>> 20 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>>> knew kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 e6 6c 42 40 00
> >>>> 00
> >>>> 10 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>>> knew kernel: (da7:mps1:0:17:0): WRITE(10). CDB: 2a 00 e4 d8 2a 90
> >>>> 00
> >>>> 00 90 00 Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally Re-queue Request Apr 25 05:34:51 knew kernel:
> >>> (da7:mps1:0:17:0): Error 5, Periph was invalidated Apr 25 05:34:51
> >>> knew
> >>> kernel: (da7:mps1:0:17:0): READ(10). CDB: 28 00 4d ac ed b8 00 00 08
> >>> 00
> >> Apr
> >>> 25 05:34:51 knew kernel: (da7:mps1:0:17:0): CAM status:
> >>> Unconditionally
> >> Re-
> >>> queue Request Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Error
> >>> 5, Periph was invalidated Apr 25 05:34:51 knew kernel: GEOM_MIRROR:
> >>> Device
> >>> swap: provider da7p2 disconnected.
> >>>> Apr 25 05:34:51 knew devd: Executing 'logger -p kern.notice -t ZFS
> >>>> 'vdev
> >> is
> >>> removed, pool_guid=15378250086669402288
> >>> vdev_guid=13410079654596185797''
> >>>> Apr 25 05:34:51 knew kernel: (da7:mps1:0:17:0): Periph destroyed
> >>>> Apr
> >>>> 25 05:34:51 knew ZFS: vdev is removed,
> >>>> pool_guid=15378250086669402288
> >>>> vdev_guid=13410079654596185797
> >>>
> >>> Current status: after powering off the box, reseating the cables for
> >>> that
> >> drive, I
> >>> powered up the system and a resilver commenced which completed in 30
> >>> minutes.
> >>>
> >>> Seems OK now.  I am not sure if the two events are related.
> >>
> >> Recently, a bug was uncovered where a device is gets 'lost'.
> >>
> >> Here's what happens:
> >> The first message in your "failure on Monday"  log is for
> >> 'mpssas_prepare_remove'.  This message is likely logged because the
> >> FW sends an event to the driver that the device is no longer
> >> responsive (pulled, cable issue, or something else).  When the driver
> >> gets this event, it sends a reset to the device to clear out any
> >> pending I/O.  This is where the 'terminated ioc' messages come from.
> >> When the reset completes, the driver is supposed to send a
> >> SAS_IO_UNIT message to FW so that the DevHandle for that disk is
> >> removed from FW's list.  Then, when the device comes back on-line,
> >> everything is fine. But, with this bug, before that SAS_IO_UNIT
> >> message is sent to FW, the driver exits the function where that
> >> happens (mpssas_remove_device).  This happens where you see the log
> >> message, "IOCStatus - 0x4b while resetting device 0x0a".  The driver
> >> logs that message and then exits.  What the driver should do is log
> >> that message and continue on to send the SAS_IO_UNIT message to FW.
> The fix is to remove the two lines in the driver shown here with '>>':
> >>
> >>    if (le16toh(reply->IOCStatus) != MPI2_IOCSTATUS_SUCCESS) {
> >>        mps_printf(sc, "IOCStatus = 0x%x while resetting device
> >> 0x%x\n",
> >>           le16toh(reply->IOCStatus), handle);
> >>>>        mpssas_free_tm(sc, tm);
> >>>>        return;
> >>    }
> >>
> >> A reboot will solve the problem, as you saw, but the real fix is to
> >> remove the DevHandle as described above.  This fix will go into the
> >> driver before the next scheduled release and then MFC'd to 10.x.
> >
> > Thanks for the diagnosis, Steve.  I forgot about that case.  We should
> > also make this chain of events more evident in the syslog, it’s very confusing
> when it happens.
> > I’m not exactly sure yet what it should look like.
> >
> > Scott
> >
> 
> Steve,
> 
> We are seeing this same error case for mpr and mps in FreeNAS. We are
> decoupled from FreeBSD releases and have an auto-update mechanism.
> 
> I'd be very happy to test patches against stable/10 on our nightly users.

Hi Josh, this fix just went into head with commit r299274.  I'll try to get this into stable/10 next week.  Do you need anything else?

Steve

help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?22c001d1aace$c41ffa70$4c5fef50$>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation