From owner-freebsd-bugs@freebsd.org  Wed May  6 20:21:59 2020
Return-Path: <owner-freebsd-bugs@freebsd.org>
Delivered-To: freebsd-bugs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 521DB2DEDE7
 for <freebsd-bugs@mailman.nyi.freebsd.org>;
 Wed,  6 May 2020 20:21:59 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from mailman.nyi.freebsd.org (unknown [127.0.1.3])
 by mx1.freebsd.org (Postfix) with ESMTP id 49HSgC1Q86z48cY
 for <freebsd-bugs@freebsd.org>; Wed,  6 May 2020 20:21:59 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: by mailman.nyi.freebsd.org (Postfix)
 id 305B02DEDE1; Wed,  6 May 2020 20:21:59 +0000 (UTC)
Delivered-To: bugs@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 2FEB42DEDE0
 for <bugs@mailman.nyi.freebsd.org>; Wed,  6 May 2020 20:21:59 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from mxrelay.nyi.freebsd.org (mxrelay.nyi.freebsd.org
 [IPv6:2610:1c1:1:606c::19:3])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (4096 bits) client-digest SHA256)
 (Client CN "mxrelay.nyi.freebsd.org",
 Issuer "Let's Encrypt Authority X3" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 49HSgC0jyXz48cX
 for <bugs@FreeBSD.org>; Wed,  6 May 2020 20:21:59 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2610:1c1:1:606c::50:1d])
 (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
 server-signature RSA-PSS (4096 bits))
 (Client did not present a certificate)
 by mxrelay.nyi.freebsd.org (Postfix) with ESMTPS id 13CF05358
 for <bugs@FreeBSD.org>; Wed,  6 May 2020 20:21:59 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org ([127.0.1.5])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id 046KLw8Q044134
 for <bugs@FreeBSD.org>; Wed, 6 May 2020 20:21:58 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
Received: (from www@localhost)
 by kenobi.freebsd.org (8.15.2/8.15.2/Submit) id 046KLwQE044112
 for bugs@FreeBSD.org; Wed, 6 May 2020 20:21:58 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
X-Authentication-Warning: kenobi.freebsd.org: www set sender to
 bugzilla-noreply@freebsd.org using -f
From: bugzilla-noreply@freebsd.org
To: bugs@FreeBSD.org
Subject: [Bug 224496] mpr and mps drivers seems to have issues with large
 seagate drives
Date: Wed, 06 May 2020 20:21:57 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 11.1-STABLE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Some People
X-Bugzilla-Who: freebsd@diyit.org
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: 
Message-ID: <bug-224496-227-gfJzjX1g5r@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
References: <bug-224496-227@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs/>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 06 May 2020 20:21:59 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D224496

--- Comment #41 from Sharad Ahlawat <freebsd@diyit.org> ---
(In reply to Sharad Ahlawat from comment #28)

Symptom:
CAM retry and timeout messages leading to controller aborts and resets

Cause:
slow disks or using SMR disks

Workaround:
Increase the CAM timeout defaults

=E2=9D=AF sysctl kern.cam.da.default_timeout=3D90
kern.cam.da.default_timeout: 60 -> 90
=E2=9D=AF sysctl kern.cam.ada.default_timeout=3D60
kern.cam.ada.default_timeout: 30 -> 60

And disable NCQ on SMR Seagates:
=E2=9D=AF cat cam-tags.sh
#!/usr/local/bin/bash
#shrinking the command Native Command Queue down to 1 effectively disabling
queuing
for Disk in `camcontrol devlist | grep "ST8000DM" | cut -d"," -f2 | cut -d"=
)"
-f1`;
do
    camcontrol tags $Disk -N 1 ;
    # camcontrol tags $Disk -v
done

If you only have SMRs in your setup and use an UPS you could also:
=E2=9D=AF sysctl vfs.zfs.cache_flush_disable=3D1

Solution:
don't use slow disks and SMRs disks with ZFS


The long version:

I am obliged to post this update given the driver downgrade workaround I
previously posted on this thread before getting to the root cause for these
logs in my messages file after upgrading to 12.x

Jan 18 17:29:28 nas kernel: ahcich6: Timeout on slot 8 port 0
Jan 18 17:29:28 nas kernel: ahcich6: is 00000000 cs 00000100 ss 00000000 rs
00000100 tfd c0 serr 00000000 cmd 0000c817
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): FLUSHCACHE48. ACB: ea 00 =
00
00 00 40 00 00 00 00 00 00
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo=
ut
Jan 18 17:29:28 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 0 more
tries remain
Jan 18 17:30:00 nas kernel: ahcich6: AHCI reset: device not ready after 310=
00ms
(tfd =3D 00000080)
Jan 18 17:30:30 nas kernel: ahcich6: Timeout on slot 9 port 0
Jan 18 17:30:30 nas kernel: ahcich6: is 00000000 cs 00000200 ss 00000000 rs
00000200 tfd 80 serr 00000000 cmd 0000c917
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): ATA_IDENTIFY. ACB: ec =
00
00 00 00 40 00 00 00 00 00 00
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): CAM status: Command
timeout
Jan 18 17:30:30 nas kernel: (aprobe0:ahcich6:0:0:0): Retrying command, 0 mo=
re
tries remain

Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1039 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1357 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1933 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0=
0 01
45 fb 37 c8 00 00 00 b0 00 00
Apr 25 22:28:12 nas kernel: mps0: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0=
0 01
45 fb 38 78 00 00 00 58 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a1 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: Controller reported scsi ioc terminated tgt 11 =
SMID
621 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 476 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 321 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1873 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1852 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 1742 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 387 loginfo 31080000
Apr 25 22:28:12 nas kernel: mps0: Controller reported scsi ioc terminated t=
gt
11 SMID 2104 loginfo 31080000
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a2 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a3 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a4 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a5 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a6 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a7 d8 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): WRITE(16). CDB: 8a 00 00 00 =
00
01 b4 c0 a8 d8 00 00 00 c0 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0=
0 01
45 fb 37 c0 00 00 00 08 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: CCB request
completed with an error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command, 3 more tri=
es
remain
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): READ(16). CDB: 88 00 00 00 0=
0 01
45 fb 36 c0 00 00 01 00 00 00
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): CAM status: SCSI Status Error
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): Retrying command (per sense
data)

Apr 29 14:07:38 nas kernel: ahcich6: Timeout on slot 22 port 0
Apr 29 14:07:38 nas kernel: ahcich6: is 00000000 cs 00000000 ss 00400000 rs
00400000 tfd 40 serr 00000000 cmd 0000d617
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): READ_FPDMA_QUEUED. ACB: 6=
0 80
80 44 e8 40 c9 01 00 00 00 00
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): CAM status: Command timeo=
ut
Apr 29 14:07:38 nas kernel: (ada6:ahcich6:0:0:0): Retrying command, 3 more
tries remain
Apr 29 14:08:09 nas kernel: ahcich6: AHCI reset: device not ready after 310=
00ms
(tfd =3D 00000080)

=E2=9D=AF smartctl -a /dev/da4
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED=20
WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always=20=
=20=20=20=20=20
-       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always=20=
=20=20=20=20=20
-       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline=
=20=20=20=20=20
-       0
199 UDMA_CRC_Error_Count=C2=A0 =C2=A0 0x003e =C2=A0 200 =C2=A0 178 =C2=A0 0=
00=C2=A0 =C2=A0 Old_age =C2=A0 Always =C2=A0 =C2=A0 =C2=A0
- =C2=A0 =C2=A0 =C2=A0 419
Note: Every controller ABORTED COMMAND increases the ID# 199 count

Messages are correlated with sustained disk operations events:
        scrubs
        large file transfers
        cron jobs kicking off on multiple jails accessing the same pool, et=
c.

Ruled out any issues with:
        loose cable
        cable quality
        controller hardware, bios and firmware
        power supply

=E2=9D=AF mpsutil show all
Adapter:
mps0 Adapter:
       Board Name: SAS9207-4i4e
   Board Assembly: H3-25434-00H
        Chip Name: LSISAS2308
    Chip Revision: ALL
    BIOS Revision: 7.39.02.00
Firmware Revision: 20.00.07.00
  Integrated RAID: no

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0009       N         6.0     1.5    6.0    SAS Initiator
1       0002        000a       N         6.0     1.5    6.0    SAS Initiator
2       0003        000b       N         6.0     1.5    6.0    SAS Initiator
3       0004        000c       N         6.0     1.5    6.0    SAS Initiator
4       0005        000d       N         6.0     1.5    6.0    SAS Initiator
5       0006        000e       N         6.0     1.5    6.0    SAS Initiator
6       0007        000f       N         6.0     1.5    6.0    SAS Initiator
7       0008        0010       N         6.0     1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot =
 Wdt
00   10   4433221100000000 0009    0001      SATA Target   6.0   0001 00   =
 1
00   09   4433221101000000 000a    0002      SATA Target   6.0   0001 01   =
 1
00   11   4433221102000000 000b    0003      SATA Target   6.0   0001 02   =
 1
00   07   4433221103000000 000c    0004      SATA Target   6.0   0001 03   =
 1
00   13   4433221104000000 000d    0005      SATA Target   6.0   0001 07   =
 1
00   04   4433221105000000 000e    0006      SATA Target   6.0   0001 06   =
 1
00   14   4433221106000000 000f    0007      SATA Target   6.0   0001 05   =
 1
00   12   4433221107000000 0010    0008      SATA Target   6.0   0001 04   =
 1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  08    500605b0073400a0               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level


=E2=9D=AF camcontrol devlist

<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus0 target 0 lun 0 (ada0,pass0)
<Samsung SSD 850 PRO 256GB EXM02B6Q>  at scbus1 target 0 lun 0 (ada1,pass1)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus2 target 0 lun 0 (ada2,pass2)
<Samsung SSD 860 EVO 500GB RVT03B6Q>  at scbus3 target 0 lun 0 (ada3,pass3)
<ST8000DM004-2CX188 0001>          at scbus4 target 0 lun 0 (ada4,pass4)
<ST8000DM004-2CX188 0001>          at scbus5 target 0 lun 0 (ada5,pass5)
<ST8000DM004-2CX188 0001>          at scbus6 target 0 lun 0 (ada6,pass6)
<ST8000DM004-2CX188 0001>          at scbus7 target 0 lun 0 (ada7,pass7)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus8 target 0 lun 0 (ada8,pass8)
<WDC WD20EFRX-68EUZN0 80.00A80>    at scbus9 target 0 lun 0 (ada9,pass9)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 4 lun 0 (pass10,da0)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 7 lun 0 (pass11,da1)
<ATA ST4000NM0024-1HT SN05>        at scbus10 target 9 lun 0 (pass12,da2)
<ATA ST4000NM0024-1HT SN02>        at scbus10 target 10 lun 0 (pass13,da3)
<ATA ST4000NM0024-1HT SN06>        at scbus10 target 11 lun 0 (pass14,da4)
<ATA ST4000DX001-1CE1 CC44>        at scbus10 target 12 lun 0 (pass15,da5)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 13 lun 0 (pass16,da6)
<ATA ST4000DX001-1CE1 CC45>        at scbus10 target 14 lun 0 (pass17,da7)

Notice: da4 has a different firmware revision and most probably the reason =
that
I see relatively fewer timeouts errors on da1, da2 and da3, the other disks=
 in
the same ZFS pool.


To understand these errors messages:

https://en.wikipedia.org/wiki/SCSI_command
https://en.wikipedia.org/wiki/SCSI_CDB

=E2=9D=AF cat /var/log/messages | grep "SCSI status"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI status: Check Condition

https://en.wikipedia.org/wiki/SCSI_Status_Code#02h_Check_Condition
https://en.wikipedia.org/wiki/SCSI_contingent_allegiance_condition

=E2=9D=AF cat /var/log/messages | grep "SCSI sense"
Apr 25 22:28:12 nas kernel: (da4:mps0:0:11:0): SCSI sense: ABORTED COMMAND
asc:47,3 (Information unit iuCRC error detected)

https://en.wikipedia.org/wiki/Key_Code_Qualifier
B       47      00      Aborted Command - SCSI parity error

https://www.t10.org/lists/asc-num.htm#ASC_47
47h/03h  DZTPROMAEBKVF  INFORMATION UNIT iuCRC ERROR DETECTED

CPB:
        - Byte 0-1 : other
        - Byte 2-9 : 8 bytes is the logical block address
        - Byte 10-13 : 4 bytes is the transfer length
        - Byte 14-5 : other

So in my logs we are writing 0x100 =3D 256 bytes of data and throwing error=
s,
writing to consecutive blocks. The retries generally succeed and subsequent
scrubs don't find any errors, though you may see CKSUM errors during a scrub
which I associated with the abort and controller reset events.


Summary:
12.x FreeBSD and mps driver code is more efficient than 11.x, pushing more =
data
back and forth from the disks and that has led to these timeout events

I also experimented and found that reducing the memory clock speed in the b=
ios
is another workaround to stop these timeout errors with 12.x. Basically tha=
t is
what my earlier posted workaround of using 11.x mps drivers with 12.x was
effectively doing, slowing down the data transfer rates so the disks could =
keep
up.


Disclaimer:
These observations, analysis and workarounds are specific to my setup, plea=
se
adapt and incorporate accordingly.

--=20
You are receiving this mail because:
You are the assignee for the bug.=