From owner-freebsd-scsi@freebsd.org  Mon Jun  6 08:51:43 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7F4CDB6D900
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Mon,  6 Jun 2016 08:51:43 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com
 [195.16.151.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 486991798
 for <freebsd-scsi@freebsd.org>; Mon,  6 Jun 2016 08:51:42 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11])
 by proxypop01.sare.net (Postfix) with ESMTPSA id A72EA9DD7CD;
 Mon,  6 Jun 2016 10:42:32 +0200 (CEST)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
Date: Mon, 6 Jun 2016 10:42:32 +0200
Cc: freebsd-scsi@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
To: Steven Hartland <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Jun 2016 08:51:43 -0000


> On 03 Jun 2016, at 23:49, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> First thing would be to run gstat with -d to see if you're actually =
stacking up deletes, a symptom of which can be r/w dropping to zero.
>=20
> If you are seeing significant deletes it could be a FW issue on the =
drives.

Hmm. I=E2=80=99ve suffered that badly with Intel P3500 NVMe drives, =
which suffer at least from a driver problem: trims are not coalesced.=20
However I didn=E2=80=99t experience command timeouts. Reads and, =
especially, writes, stalled badly.

A quick test for trim related trouble is setting the sysctl variable =
vfs.zfs.vdev.bio_delete_disable to 1. It doesn=C2=B4t require
a reboot and you can quickly compare results.

In my case, a somewhat similar problem in an IBM server was caused by a =
faulty LSI3008 card it seems. As I didn=C2=B4t have spare LSI3008 cards
at the time I replaced it by a LSI2008 and everything works perfectly. =
Before anyone chimes in suggesting card incompatibility of some sort,
I have a twin system with a LSI3008 working like a charm. ;)


Borja.


From owner-freebsd-scsi@freebsd.org  Mon Jun  6 22:19:12 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id E40F3B63C72
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Mon,  6 Jun 2016 22:19:12 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id C196317B5
 for <freebsd-scsi@freebsd.org>; Mon,  6 Jun 2016 22:19:11 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id D084921AA6C
 for <freebsd-scsi@freebsd.org>; Mon,  6 Jun 2016 17:19:04 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id QU7in_cUxwyD for <freebsd-scsi@freebsd.org>;
 Mon,  6 Jun 2016 17:19:02 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id C90D221AA58
 for <freebsd-scsi@freebsd.org>; Mon,  6 Jun 2016 17:19:02 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
Date: Mon, 6 Jun 2016 17:19:02 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 06 Jun 2016 22:19:13 -0000

System was running solid all weekend with camcontrol tags set to 1 for 
each device, zero errors.

Last week I did try
*# sysctl kern.cam.da.X.delete_method=**DISABLE*
for each drive, but it still threw errors.

Also, I did try out bio_delete_disable earlier today:
*# camcontrol tags daX -N 255*
(Firstly resetting tags back to 255 for each device, as they are 
currently at 1.)

*# sysctl vfs.zfs.vdev.bio_delete_disable=1*
(a few minutes later)

Jun  6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 577 command timeout cm 
0xfffffe0001351550 ccb 0xfffff804e78e3800 target 12, handle(0x000c)
Jun  6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4, 
connector name (    )
Jun  6 12:28:36 s18 kernel: mpr0: timedout cm 0xfffffe0001351550 
allocated tm 0xfffffe0001322150
Jun  6 12:28:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
Aborting command 0xfffffe0001351550
Jun  6 12:28:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 12
Jun  6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 18 
45 1c c0 00 00 08 00 length 4096 SMID 583 command timeout cm 
0xfffffe0001351d30 ccb 0xfffff806b9556800 target 12, handle(0x000c)
Jun  6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4, 
connector name (    )
Jun  6 12:28:36 s18 kernel: mpr0: queued timedout cm 0xfffffe0001351d30 
for processing by tm 0xfffffe0001322150
...

During the 60 second hang:
*# gstat -do*
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s kBps   
ms/d    o/s   ms/o   %busy Name
    70      0      0      0    0.0      0      0    0.0 0      0    
0.0      0    0.0    0.0| da2
     0      0      0      0    0.0      0      0    0.0 0      0    
0.0      0    0.0    0.0| da4
     0      0      0      0    0.0      0      0    0.0 0      0    
0.0      0    0.0    0.0| da6
     0      0      0      0    0.0      0      0    0.0 0      0    
0.0      0    0.0    0.0| da7


Also during the 60 second hang:
*# camcontrol tags da3 -v*
(pass2:mpr0:0:12:0): dev_openings  248
(pass2:mpr0:0:12:0): dev_active    7
(pass2:mpr0:0:12:0): allocated     7
(pass2:mpr0:0:12:0): queued        0
(pass2:mpr0:0:12:0): held          0
(pass2:mpr0:0:12:0): mintags       2
(pass2:mpr0:0:12:0): maxtags       255

Also during the 60 second hang:
*# sysctl dev.mpr*
dev.mpr.0.spinup_wait_time: 3
dev.mpr.0.chain_alloc_fail: 0
dev.mpr.0.enable_ssu: 1
dev.mpr.0.max_chains: 2048
dev.mpr.0.chain_free_lowwater: 2022
dev.mpr.0.chain_free: 2048
dev.mpr.0.io_cmds_highwater: 71
dev.mpr.0.io_cmds_active: 4
dev.mpr.0.driver_version: 09.255.01.00-fbsd
dev.mpr.0.firmware_version: 10.00.03.00
dev.mpr.0.disable_msi: 0
dev.mpr.0.disable_msix: 0
dev.mpr.0.debug_level: 895
dev.mpr.0.%parent: pci1
dev.mpr.0.%pnpinfo: vendor=0x1000 device=0x0097 subvendor=0x15d9 
subdevice=0x0808 class=0x010700
dev.mpr.0.%location: pci0:1:0:0 handle=\_SB_.PCI0.BR1A.H000
dev.mpr.0.%driver: mpr
dev.mpr.0.%desc: Avago Technologies (LSI) SAS3008
dev.mpr.%parent:

Something else that may be of consideration: I ran fio & bonnie++ for 
about an hour of heavy io (with tags still set to 255 drive busy showing 
90-100%).  No errors.  I fire up my application (threaded Java/Postgres 
application), and within minutes:

*# gstat -do*
dT: 1.002s  w: 1.000s
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s   kBps   
ms/d    o/s   ms/o   %busy Name
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da2
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da4
    71      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da6
     0      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da7

*Error:*
Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 30 
65 13 90 00 00 10 00
Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
sense data)
...

*And again 2 minutes later:*

Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 21 
66 63 58 00 00 10 00
Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status Error
Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check Condition
Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per 
sense data)
...

*And again 3 minutes later:*

Jun  6 13:41:29 s18 kernel: (da7:mpr0:0:18:0): WRITE(10). CDB: 2a 00 33 
44 b5 b8 00 00 10 00
...

*#camcontrol tags daX -N **1*
(And now, after 15 minutes, zero errors.)

In putting some thoughts to this, which may or may not be off base 
(please feel free to correct me btw), I've noticed the following:

1) There doesn't seem to be any indication as to what causes the drive 
to time-out.  The command that fails in the error log is one of the 
following: READ(10), WRITE(10), ATA COMMAND PASS THROUGH(16), and 
SYNCHRONIZE CACHE(10).  As I understand it, that was the command being 
executed, timed-out, and retried, not what potentially caused the drive 
lock-up.

2) When my application is run, it hammers postgres pretty hard, and when 
postgres is running I get the errors. FIO & Bonnie++ doesn't give me 
errors; daily use of the system doesn't give me errors.  I'm assuming 
postgresql is sending far more of a certain type of command to the io 
subsystem than those other applications, and the first command that 
comes to mind is fsync.

3) I turned fsync off in postgresql.conf (I'm brave for science!!) then 
ran my application again with tags at 255, 100% cpu load, 70-80% drive 
busy%.

*1.5 hours later at full load - finally, a single timeout:*
Jun  6 16:31:33 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 2d 
50 1b 78 00 00 08 00 length 4096 SMID 556 command timeout cm 
0xfffffe000134f9c0 ccb 0xfffff83aa5b25000 target 12, handle(0x000c)

I ran it for another 20 minutes with no additional timeouts.

I assume the fsync command turns into a zfs -> cam -> SYNCHRONIZE CACHE 
command for each device.  And postgres is sending this command 
considerably more often than a typical application (at least with fsync 
turned on in postgresql.conf), which would explain why when fsync is 
turned off or minimal fsyncs are being sent (ie typical system usage), 
the error is rare.  Yet, when fsync is being sent repeatedly, the errors 
start happening every few minutes.  The only reason I can think why 
setting tags to 1 eliminates the errors entirely must have something to 
do with Intel drives not handling parallel commands from cam when one 
(or more) of the commands are SYNCHRONIZE CACHE.  Thoughts?

Thanks,

-Kyle


On 6/6/16 3:42 AM, Borja Marcos wrote:
>> On 03 Jun 2016, at 23:49, Steven Hartland <killing@multiplay.co.uk> wrote:
>>
>> First thing would be to run gstat with -d to see if you're actually stacking up deletes, a symptom of which can be r/w dropping to zero.
>>
>> If you are seeing significant deletes it could be a FW issue on the drives.
> Hmm. I’ve suffered that badly with Intel P3500 NVMe drives, which suffer at least from a driver problem: trims are not coalesced.
> However I didn’t experience command timeouts. Reads and, especially, writes, stalled badly.
>
> A quick test for trim related trouble is setting the sysctl variable vfs.zfs.vdev.bio_delete_disable to 1. It doesn´t require
> a reboot and you can quickly compare results.
>
> In my case, a somewhat similar problem in an IBM server was caused by a faulty LSI3008 card it seems. As I didn´t have spare LSI3008 cards
> at the time I replaced it by a LSI2008 and everything works perfectly. Before anyone chimes in suggesting card incompatibility of some sort,
> I have a twin system with a LSI3008 working like a charm. ;)
>
>
>
>
>
>
>
> Borja.
>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 06:35:09 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 477A1B6D2DE
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 06:35:09 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com
 [195.16.150.151])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 08F6612E0
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 06:35:08 +0000 (UTC)
 (envelope-from borjam@sarenet.es)
Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11])
 by proxypop03.sare.net (Postfix) with ESMTPSA id 1E7E89DE019;
 Tue,  7 Jun 2016 08:25:19 +0200 (CEST)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
From: Borja Marcos <borjam@sarenet.es>
In-Reply-To: <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
Date: Tue, 7 Jun 2016 08:25:19 +0200
Cc: freebsd-scsi@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
To: list-news <list-news@mindpackstudios.com>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 06:35:09 -0000


> On 07 Jun 2016, at 00:19, list-news <list-news@mindpackstudios.com> =
wrote:
>=20
> *# sysctl vfs.zfs.vdev.bio_delete_disable=3D1*
> (a few minutes later)

So trim is not causing it.

>=20
> *Error:*
> Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 =
30 65 13 90 00 00 10 00
> Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status =
Error
> Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check =
Condition
> Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT =
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per =
sense data)
> ...
>=20
> *And again 2 minutes later:*
>=20
> Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 =
21 66 63 58 00 00 10 00
> Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status =
Error
> Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check =
Condition
> Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT =
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per =
sense data)


I suffered this particular symptom because, it seems of a broken LSI3008 =
card. Finally I replaced it with a LSI2008 (I didn=E2=80=99t have a =
spare
LSI3008 handy) and the errors vanished. In my case it is a NFS storage =
based on ZFS and Samsung SSD disks serving several Xen=20
hosts.

In my case the disks are SATA.

I know that it was a defective card and not a problem with the LSI3008 =
cards or driver because I have a twin system working like a charm
from day zero.

I would try, if possible, to swap the controller.=20


Borja.


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 17:09:12 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0A487B6EDC2
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 17:09:12 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DE4B11FBC
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 17:09:11 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 38327219A73;
 Tue,  7 Jun 2016 12:09:10 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id t1SO_uXu9uP1; Tue,  7 Jun 2016 12:09:08 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 59F36219A69;
 Tue,  7 Jun 2016 12:09:08 -0500 (CDT)
From: list-news <list-news@mindpackstudios.com>
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: Borja Marcos <borjam@sarenet.es>
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
Cc: freebsd-scsi@freebsd.org
Message-ID: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
Date: Tue, 7 Jun 2016 12:09:08 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 17:09:12 -0000

The system is a Twin.  In the first post I mentioned this but I probably 
wasn't clear.

The twin unit is this one:
https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm

I've used all components from twin node A and B (cpu / memory / 
mainboard / controller).  I still get the errors.  The backplane was the 
original thought of concern, and that has been RMA'd and replaced - 
errors continue.  I've even swapped out power supplies with another 
identical unit I have here.

In every case the errors continue, until I do this:
#camcontrol daX -N 1
(for each drive in the zpool)

Then the errors stop.

The system errors every few minutes while my application is running.  
Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu and 
drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
finishes a batch, no errors are reported with -N set to 1.  If I set 
tags with -N 255 for each device, errors start again within 5 minutes, 
and continue every 2-5 minutes, until the batch is finished.

-Kyle

> I would try, if possible, to swap the controller.
>
>
>
>
>
>
> Borja.
>
>


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 19:02:30 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 47F30B6ED10
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 19:02:30 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com
 [IPv6:2a00:1450:400c:c09::233])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id D92401D83
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 19:02:29 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x233.google.com with SMTP id k204so81978649wmk.0
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 12:02:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=5hIHKHVFrNrVSoO0S5WyI7gKopINdCIOgNvFVUELoo8=;
 b=GgR2wayFVgk1GC0X8Yzh/b0BpJfV88RG6dnq1mILcEsMpuHFqiEJ7H9oG+q844eAyW
 eVzDC88cf6K1U5Ikt+imlehkL0aHyxOR3B0ub/mlDZpvvlcy+R75T9N5xHQbmEfD3r4J
 WBGY070iWd+hkH5fFyKrv64+sXPGXh96CyjO1dxgz2JjdK56/4Z7Mx10QwaFSFZbO2MV
 cpgZNKnyl86ao7RlvrWkAfq7OmgkZ7B+silRJrMC70RQ7/OUSLtaoKVsPzuoxFl/A3f5
 EBKKfGv4pnwt5UZ6T1z/3Pxrn3CyJQj0fjwXi26aYFros9umqHzzua/LtYN1bc4PSwRe
 S0zQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=5hIHKHVFrNrVSoO0S5WyI7gKopINdCIOgNvFVUELoo8=;
 b=b2GUulLhPgqXzuuLsY/W8AjS/eSA400Hs83XbHkvNY6fxD/wiiY6MlTljJF7+zJgxA
 KEDXew3Y/46mjsrJX29qCSXQqNIna5k6qFl85DJTiwF/xn93306IhQXtEBlIjfUjSoHG
 CHuc/pCvG/UJdoc4AGHt5UsGMikiAsHxXtC/YuQz04072avOg17Nm4Jy/d54L9u24Nux
 oScXHePgqsi5HMXVi862BZdcNJmul+M5a2/bjjjMPKL48sek7PAhmhvMRPf15tpGGihO
 pGczbg/SRJcCff6SCeSIrkdv/bcWXEpmW0s2j3FymaVRABnqiHaqtgRB/ofB0qKnf0PG
 iA8w==
X-Gm-Message-State: ALyK8tJoMj1GKIs6yLNDdNR8XNxs35Y+aQXSEEpOfeMlwRMNwku6na1ULw8ldhUziLXMe9vB
X-Received: by 10.28.73.198 with SMTP id w189mr1164262wma.32.1465326147557;
 Tue, 07 Jun 2016 12:02:27 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id d7sm20832323wmd.11.2016.06.07.12.02.26
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 12:02:26 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk>
Date: Tue, 7 Jun 2016 20:02:31 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 19:02:30 -0000

Have you tried direct attaching the drives?

On 07/06/2016 18:09, list-news wrote:
> The system is a Twin.  In the first post I mentioned this but I 
> probably wasn't clear.
>
> The twin unit is this one:
> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>
> I've used all components from twin node A and B (cpu / memory / 
> mainboard / controller).  I still get the errors.  The backplane was 
> the original thought of concern, and that has been RMA'd and replaced 
> - errors continue.  I've even swapped out power supplies with another 
> identical unit I have here.
>
> In every case the errors continue, until I do this:
> #camcontrol daX -N 1
> (for each drive in the zpool)
>
> Then the errors stop.
>
> The system errors every few minutes while my application is running.  
> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu and 
> drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
> finishes a batch, no errors are reported with -N set to 1.  If I set 
> tags with -N 255 for each device, errors start again within 5 minutes, 
> and continue every 2-5 minutes, until the batch is finished.
>
> -Kyle
>
>> I would try, if possible, to swap the controller.
>>
>>
>>
>>
>>
>>
>> Borja.
>>
>>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 19:24:40 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 08AB6B6D148
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 19:24:40 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id DB11D16F5
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 19:24:39 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 021FA21B7D5
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 14:24:38 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id GdMerZY5S3fG for <freebsd-scsi@freebsd.org>;
 Tue,  7 Jun 2016 14:24:36 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 0DC2D21B7CE
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 14:24:36 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
Date: Tue, 7 Jun 2016 14:24:35 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 19:24:40 -0000

I have additional confirmation that it's not faulty hardware.

I moved the 4 disks that carry the postgresql database over to another 
server (same model - TWIN 2028-DECR).  Mounted the zpool and fired up my 
application.

This server is using a much earlier firmware on the SAS controller.  
Different CPU / Memory / etc.

Errors happen within the first couple minutes, and continue every few 
minutes (notice time-stamps for each drive timeout every few minutes):

Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 
74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 
74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33 
91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 
f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b 
scsi 0 state c xfer 0
Jun  7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 
ea dc 60 00 00 08 00
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 
74 79 e0 00 00 08 00
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status 
Error
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per 
sense data)
Jun  7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 
Aborting command 0xfffffe0000be0140
Jun  7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 10
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 
be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b 
scsi 0 state c xfer 0
Jun  7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 
26 ca f0 00 00 08 00
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f6 ee f0 00 00 08 00
Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per 
sense data)
Jun  7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 
Aborting command 0xfffffe0000bfcca0
Jun  7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 10
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b 
scsi 0 state c xfer 0
Jun  7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
d6 de f0 00 00 08 00
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL), 
code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): 
originator(PL), code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): 
originator(PL), code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: 0): mpr0: Retrying command
Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00
Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB 
request completed with an error
Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): 
originator(PL), code(0x12), sub_code(0x0440)
Jun  7 13:13:05 s17 kernel: 10:0): Retrying command
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
8d 99 48 00 00 08 00
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
6b df b8 00 00 10 00
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f7 cd a8 00 00 08 00
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
f7 cd b0 00 00 08 00
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
d6 de f0 00 00 08 00
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
completed with an error
Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
Jun  7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 
Aborting command 0xfffffe0000c1e960
Jun  7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 11
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS 
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 
512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 
7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 
bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 
state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00
Jun  7 13:15:12 s17 kernel: 0
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 
state c xfer (da3:0
Jun  7 13:15:12 s17 kernel: mpr0:0:    (da3:mpr0:0:11:0): SYNCHRONIZE 
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 
terminated ioc 804b scsi 0 sta11:te c xfer 0
Jun  7 13:15:12 s17 kernel: 0): mpr0: Retrying command
Jun  7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per 
sense data)

gstat output:
(I'm guessing I caught this during the da2 error)

#gstat -do
  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s   kBps   
ms/d    o/s   ms/o   %busy Name
    70      0      0      0    0.0      0      0 0.0      0      0    
0.0      0    0.0    0.0| da2
     0      0      0      0    0.0      0      0    0.0      0      0    
0.0      0 0.0   0.0| da3
     0      0      0      0    0.0      0      0    0.0      0      0    
0.0      0 0.0   0.0| da10
     0      0      0      0    0.0      0      0    0.0      0      0    
0.0      0 0.0    0.0| da11


I then set the tags down to 1 for each device:

#camcontrol tags da2 -N 1
#camcontrol tags da3 -N 1
#camcontrol tags da10 -N 1
#camcontrol tags da11 -N 1

And, no errors for the last hour, system still running at full load.

Everything is feeling like an NCQ firmware issue.  Intel s3610 says it 
supports NCQ in it's SSDs with 32 tags.  But I've pulled the errors with 
tags set to 8 plenty of times.

(See NCQ line below.)

# camcontrol identify da2

pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
pass2: 1200.000MB/s transfers, Command Queueing Enabled
protocol              ATA/ATAPI-9 SATA 3.x
device model          INTEL SSDSC2BX480G4
firmware revision     G2010150
serial number         [redacted]
WWN [redacted]
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 4096, offset 0
LBA supported         268435455 sectors
LBA48 supported       937703088 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             non-rotating

Feature                      Support  Enabled   Value Vendor
read ahead                     yes    yes
write cache                    yes    yes
flush cache                    yes    yes
overlap                        no
Tagged Command Queuing (TCQ)   no     no
Native Command Queuing (NCQ)   yes              32 tags
NCQ Queue Management           no
NCQ Streaming                  no
Receive & Send FPDMA Queued    no
SMART                          yes    yes
microcode download             yes    yes
security                       yes    no
power management               yes    yes
advanced power management      no     no
automatic acoustic management  no     no
media status notification      no     no
power-up in Standby            no     no
write-read-verify              no     no
unload                         yes    yes
general purpose logging        yes    yes
free-fall                      no     no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks       yes              4
DSM - deterministic read       yes              zeroed
Host Protected Area (HPA)      yes      no 937703088/937703088
HPA - Security                 no

And it doesn't appear I have any way to deactivate it in firmware.  
Which would be a nice test.  I did attempt this with no luck:
# camcontrol negotiate da2 -T disable
(pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
(pass2:mpr0:0:10:0): tagged queueing: enabled
camcontrol: XPT_SET_TRANS_SETTINGS CCB failed

-Kyle


On 6/7/16 12:09 PM, list-news wrote:
> The system is a Twin.  In the first post I mentioned this but I 
> probably wasn't clear.
>
> The twin unit is this one:
> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>
> I've used all components from twin node A and B (cpu / memory / 
> mainboard / controller).  I still get the errors.  The backplane was 
> the original thought of concern, and that has been RMA'd and replaced 
> - errors continue.  I've even swapped out power supplies with another 
> identical unit I have here.
>
> In every case the errors continue, until I do this:
> #camcontrol daX -N 1
> (for each drive in the zpool)
>
> Then the errors stop.
>
> The system errors every few minutes while my application is running.  
> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu and 
> drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
> finishes a batch, no errors are reported with -N set to 1.  If I set 
> tags with -N 255 for each device, errors start again within 5 minutes, 
> and continue every 2-5 minutes, until the batch is finished.
>
> -Kyle
>
>> I would try, if possible, to swap the controller.
>>
>>
>>
>>
>>
>>
>> Borja.
>>
>>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 19:53:08 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8DFFEB6D843
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 19:53:08 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com
 [IPv6:2a00:1450:400c:c09::232])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 087DB1528
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 19:53:08 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x232.google.com with SMTP id k204so83570160wmk.0
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 12:53:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=OyWE/LDvMML2a8/vBR7DS72Be3W4ajYMeIMfKmubKFc=;
 b=qaoRJQJk4pOrwFH6RMlcq465W+/CNyRbatsuAaD2N6mN43z/pcZhXJQ0GtIm5FC95A
 l2arOrHD9T8JlN6MI2oB+nFOo94W3EbxP2ZjhZpaufw1LewsvRFq6H3OPC3MOUiA9ha+
 FVt7552OWKtfF7TYvMzFAJnDnBZnzZQoaILFR8WmQzf1i8FGWRbQ1+y7WerAh/msB1G+
 ZwFkT0PqiN9Z4ZQRNjDdlzwn8AmitDYxSjv2+5YaoA4fol7PpnBTn3gesRCTjublXD78
 5AGSQjJWt+v5vQNaKeodmPWHuwc7jJN51cOnNXtvmQqRYedKOT7jzNIklhRrCoJYZ6oS
 qBNQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=OyWE/LDvMML2a8/vBR7DS72Be3W4ajYMeIMfKmubKFc=;
 b=kgoRWam3kz2/jpeAhpbCccVOnbwtUh6sEwPnkI9uEVe7GqSvnTebL/CuBf50OBmREP
 iQbQsS3xI8AP/qDHua/BKL5sz1DJ7ybEIva8SIGHO/jGv/1+EHnvudR0IB3LzGZxgFrX
 LMdaxd8s+zYcKR7Mb7iT3onBO1d0J6vIAQRjDLQHLI9QWp64JMdHewbNVORu3Ue2sWVT
 Kc9MsLCOf7VgvropWsi9EcEIBLuPYbsBmYVZtsecBieOxFPp35yqRz0W1bJdybZ8HW31
 TUCCBeceozJO7iw7OylJTcm1lhiPHkKCy3FMO6Ugtdu+6ovWA+mp5CoKD7dQqiZM/CRv
 7Q3Q==
X-Gm-Message-State: ALyK8tJJlke8H9vvETKopxIL3Dl+FjGE+B4tJFUeFza43ENMIGTio0ezcU6oJimIUkyNFX6P
X-Received: by 10.28.26.138 with SMTP id a132mr4425191wma.82.1465329186240;
 Tue, 07 Jun 2016 12:53:06 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id c62sm20884456wmd.1.2016.06.07.12.53.04
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 12:53:05 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
Date: Tue, 7 Jun 2016 20:53:10 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 19:53:08 -0000

CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
delete but assuming your running ZFS can you set the following in 
loader.conf and see how you get on.
vfs.zfs.trim.enabled=0

     Regards
     Steve


On 07/06/2016 20:24, list-news wrote:
> I have additional confirmation that it's not faulty hardware.
>
> I moved the 4 disks that carry the postgresql database over to another 
> server (same model - TWIN 2028-DECR).  Mounted the zpool and fired up 
> my application.
>
> This server is using a much earlier firmware on the SAS controller.  
> Different CPU / Memory / etc.
>
> Errors happen within the first couple minutes, and continue every few 
> minutes (notice time-stamps for each drive timeout every few minutes):
>
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 
> 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 36 ea dc 60 00 00 08 00
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command 
> timeout
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 
> 0e 74 79 e0 00 00 08 00
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI 
> Status Error
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check 
> Condition
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per 
> sense data)
> Jun  7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 
> Aborting command 0xfffffe0000be0140
> Jun  7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 10
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 
> be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 
> 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 
> 26 ca f0 00 00 08 00
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command 
> timeout
> Jun  7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f6 ee f0 00 00 08 00
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status 
> Error
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check 
> Condition
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per 
> sense data)
> Jun  7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 
> Aborting command 0xfffffe0000bfcca0
> Jun  7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 10
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
> 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
> 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 
> 804b scsi 0 state c xfer 0
> Jun  7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
> d6 de f0 00 00 08 00
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command 
> timeout
> Jun  7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: 0): mpr0: Retrying command
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB 
> request completed with an error
> Jun  7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), 
> code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): 
> originator(PL), code(0x12), sub_code(0x0440)
> Jun  7 13:13:05 s17 kernel: 10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 
> 8d 99 48 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 
> 6b df b8 00 00 10 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd a8 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d 
> f7 cd b0 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e 
> d6 de f0 00 00 08 00
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request 
> completed with an error
> Jun  7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status 
> Error
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check 
> Condition
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
> Jun  7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
> Jun  7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 
> Aborting command 0xfffffe0000c1e960
> Jun  7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 11
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS 
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 
> length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 
> 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 
> bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
> 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 
> state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 
> 10 00
> Jun  7 13:15:12 s17 kernel: 0
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command 
> timeout
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 
> 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 
> state c xfer (da3:0
> Jun  7 13:15:12 s17 kernel: mpr0:0:    (da3:mpr0:0:11:0): SYNCHRONIZE 
> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 
> terminated ioc 804b scsi 0 sta11:te c xfer 0
> Jun  7 13:15:12 s17 kernel: 0): mpr0: Retrying command
> Jun  7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status 
> Error
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check 
> Condition
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per 
> sense data)
>
> gstat output:
> (I'm guessing I caught this during the da2 error)
>
> #gstat -do
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w d/s kBps   
> ms/d    o/s   ms/o   %busy Name
>    70      0      0      0    0.0      0      0 0.0      0 0    
> 0.0      0    0.0    0.0| da2
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0   0.0| da3
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0   0.0| da10
>     0      0      0      0    0.0      0      0    0.0      0 0    
> 0.0      0 0.0    0.0| da11
>
>
> I then set the tags down to 1 for each device:
>
> #camcontrol tags da2 -N 1
> #camcontrol tags da3 -N 1
> #camcontrol tags da10 -N 1
> #camcontrol tags da11 -N 1
>
> And, no errors for the last hour, system still running at full load.
>
> Everything is feeling like an NCQ firmware issue.  Intel s3610 says it 
> supports NCQ in it's SSDs with 32 tags.  But I've pulled the errors 
> with tags set to 8 plenty of times.
>
> (See NCQ line below.)
>
> # camcontrol identify da2
>
> pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
> pass2: 1200.000MB/s transfers, Command Queueing Enabled
> protocol              ATA/ATAPI-9 SATA 3.x
> device model          INTEL SSDSC2BX480G4
> firmware revision     G2010150
> serial number         [redacted]
> WWN [redacted]
> cylinders             16383
> heads                 16
> sectors/track         63
> sector size           logical 512, physical 4096, offset 0
> LBA supported         268435455 sectors
> LBA48 supported       937703088 sectors
> PIO supported         PIO4
> DMA supported         WDMA2 UDMA6
> media RPM             non-rotating
>
> Feature                      Support  Enabled   Value Vendor
> read ahead                     yes    yes
> write cache                    yes    yes
> flush cache                    yes    yes
> overlap                        no
> Tagged Command Queuing (TCQ)   no     no
> Native Command Queuing (NCQ)   yes              32 tags
> NCQ Queue Management           no
> NCQ Streaming                  no
> Receive & Send FPDMA Queued    no
> SMART                          yes    yes
> microcode download             yes    yes
> security                       yes    no
> power management               yes    yes
> advanced power management      no     no
> automatic acoustic management  no     no
> media status notification      no     no
> power-up in Standby            no     no
> write-read-verify              no     no
> unload                         yes    yes
> general purpose logging        yes    yes
> free-fall                      no     no
> Data Set Management (DSM/TRIM) yes
> DSM - max 512byte blocks       yes              4
> DSM - deterministic read       yes              zeroed
> Host Protected Area (HPA)      yes      no 937703088/937703088
> HPA - Security                 no
>
> And it doesn't appear I have any way to deactivate it in firmware.  
> Which would be a nice test.  I did attempt this with no luck:
> # camcontrol negotiate da2 -T disable
> (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
> (pass2:mpr0:0:10:0): tagged queueing: enabled
> camcontrol: XPT_SET_TRANS_SETTINGS CCB failed
>
> -Kyle
>
>
> On 6/7/16 12:09 PM, list-news wrote:
>> The system is a Twin.  In the first post I mentioned this but I 
>> probably wasn't clear.
>>
>> The twin unit is this one:
>> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>>
>> I've used all components from twin node A and B (cpu / memory / 
>> mainboard / controller).  I still get the errors.  The backplane was 
>> the original thought of concern, and that has been RMA'd and replaced 
>> - errors continue.  I've even swapped out power supplies with another 
>> identical unit I have here.
>>
>> In every case the errors continue, until I do this:
>> #camcontrol daX -N 1
>> (for each drive in the zpool)
>>
>> Then the errors stop.
>>
>> The system errors every few minutes while my application is running.  
>> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu 
>> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
>> finishes a batch, no errors are reported with -N set to 1.  If I set 
>> tags with -N 255 for each device, errors start again within 5 
>> minutes, and continue every 2-5 minutes, until the batch is finished.
>>
>> -Kyle
>>
>>> I would try, if possible, to swap the controller.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Borja.
>>>
>>>
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 19:53:27 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4468FB6D89A
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 19:53:27 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 0DC4B167E
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 19:53:26 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 8E6561ED4C5
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 14:53:25 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id dQxViKVTZADo for <freebsd-scsi@freebsd.org>;
 Tue,  7 Jun 2016 14:53:24 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 56DDA1ED4BD
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 14:53:24 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <d9fb93a6-d3ad-7009-3301-d6bd29be376b@mindpackstudios.com>
Date: Tue, 7 Jun 2016 14:53:23 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 19:53:27 -0000

I don't believe the mainboard has any SATA ports.  It does have a PCIe 
slot IIRC though, and I may be able to rig something up with another LSI 
adapter I have laying around.  If I can get it to fit and find a way to 
power the drives.

Although, this seems unlikely unless you are seeing something I'm not?

With that last test: If it's the SAS controller, 3 different ones 
running two different firmware versions are all causing the issue.  If 
it's the backplane, I have now tested 3 of them as well, two of which I 
can confirm have different revision numbers.

Errors never appear with tags set to 1 for each drive (effectively 
eliminating NCQ as I understand it).  My brief understanding is that a 
higher tag count allows the SAS adapter to send more commands to the 
drive in parallel, allowing the drive to make the decisions about 
command ordering.  If that is accurate, and the controller firmware was 
bad, I assume this would be a far more common bug that would have been 
fixed already.

On the other hand, if it only happens during heavy SYNCHRONIZE CACHE 
commands in parallel on certain Intel SSD's and only on controllers 
(maybe 12gbps?) that can outrun the drive firmware or cause a race 
condition (my suspicions here).  It seems far more likely this would 
have gone unnoticed by Intel.

-Kyle


On 6/7/16 2:02 PM, Steven Hartland wrote:
> Have you tried direct attaching the drives?
>
> On 07/06/2016 18:09, list-news wrote:
>> The system is a Twin.  In the first post I mentioned this but I 
>> probably wasn't clear.
>>
>> The twin unit is this one:
>> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm
>>
>> I've used all components from twin node A and B (cpu / memory / 
>> mainboard / controller).  I still get the errors.  The backplane was 
>> the original thought of concern, and that has been RMA'd and replaced 
>> - errors continue.  I've even swapped out power supplies with another 
>> identical unit I have here.
>>
>> In every case the errors continue, until I do this:
>> #camcontrol daX -N 1
>> (for each drive in the zpool)
>>
>> Then the errors stop.
>>
>> The system errors every few minutes while my application is running.  
>> Set tags to -N 1, and everything goes quiet.  16 cores at 100% cpu 
>> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it 
>> finishes a batch, no errors are reported with -N set to 1.  If I set 
>> tags with -N 255 for each device, errors start again within 5 
>> minutes, and continue every 2-5 minutes, until the batch is finished.
>>
>> -Kyle
>>
>>> I would try, if possible, to swap the controller.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Borja.
>>>
>>>
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 20:19:30 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 234AAB6E246
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 20:19:30 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F400B14EA
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 20:19:29 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 9578221A1C0
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 15:19:27 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id QWKEPzpXs5o9 for <freebsd-scsi@freebsd.org>;
 Tue,  7 Jun 2016 15:19:26 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 4C54D21A1B9
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 15:19:26 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
Date: Tue, 7 Jun 2016 15:19:25 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 20:19:30 -0000

Sure Steve:

# cat /boot/loader.conf | grep trim
vfs.zfs.trim.enabled=0

# sysctl vfs.zfs.trim.enabled
vfs.zfs.trim.enabled: 0

# uptime
3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07

# tail -f /var/log/messages:
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b 
d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
allocated tm 0xfffffe0001322150
Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
Aborting command 0xfffffe0001375580
Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
for target ID 16
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout cm 
0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
connector name (    )
Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 0xfffffe00013627a0 
for processing by tm 0xfffffe0001322150
Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
Jun  7 15:13:50 s18 kernel: EventDataLength: 2
Jun  7 15:13:50 s18 kernel: AckRequired: 0
Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
Jun  7 15:13:50 s18 kernel: EventContext: 0x0
Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi 
0 state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi 
0 state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 
25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi 
0 state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 
25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 
state c xfer 0
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b 
d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi 
0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 
00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 804b 
scsi 0 sta(da6:te c xfer 0
Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
TaskMID 1016 status 0x0 code 0x0 count 5
Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 finished 
recovery after aborting TaskMID 1016
Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
Jun  7 15:13:50 s18 kernel: EventDataLength: 4
Jun  7 15:13:50 s18 kernel: AckRequired: 0
Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
Jun  7 15:13:50 s18 kernel: EventContext: 0x0
Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
Jun  7 15:13:50 s18 kernel: NumPhys: 31
Jun  7 15:13:50 s18 kernel: NumEntries: 1
Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
Event: [16]
Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
Event: [1c]
Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
Jun  7 15:13:50 s18 kernel: EventDataLength: 2
Jun  7 15:13:50 s18 kernel: AckRequired: 0
Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
Jun  7 15:13:50 s18 kernel: EventContext: 0x0
Jun  7 15:13:50 s18 kernel: Flags: 0
Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
Event: [16]
Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
CDB: 35 00 00 00 00 00 00 00 00 00
Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
sense data)

-Kyle

On 6/7/16 2:53 PM, Steven Hartland wrote:
> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
> delete but assuming your running ZFS can you set the following in 
> loader.conf and see how you get on.
> vfs.zfs.trim.enabled=0
>
>     Regards
>     Steve


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 20:44:16 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3902EB6E9DA
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 20:44:16 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com
 [IPv6:2a00:1450:400c:c09::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id C98BD1662
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 20:44:15 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x22f.google.com with SMTP id k204so85149655wmk.0
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 13:44:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=VxnzCPJU+EWUN9c/LC81z/fsnyEdDuKlZlKEV9H87Zg=;
 b=KZz621TVswdkOVWViE/i1qs/Bb3CHyCUHhW6VCTBzKpfsS5tT9QUX1eafVthOI4a71
 64zp7IX58Io0I3s4ZrJC3C8kYXejTTzCB7eWlRa/CQZRxQdMOAaEyndohOYdaCcsIjm+
 KqrTeZUB7dZMxtY2VKKSQA95kHVD7lLGN2W1p/M/lL/GMX+XkdTf86nun77rqK97AxmX
 RAlUqeho91AjaX3x907igDP8uXy8zbY3YWfMa86NBa+x8OgjfyUlmfENk28u72x1V45w
 XxI0pkXL+QMXemmVx0vPLH8yzpNZCl1EJpu9I/CwObPEfoYZybFcMM07l2Jo2/kobvE5
 GfZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=VxnzCPJU+EWUN9c/LC81z/fsnyEdDuKlZlKEV9H87Zg=;
 b=Kuyw0QYuD8M9tFb2lLSLih342MQuLtMroHjMZrn9kPE/xVR4hD+Ex7X0WK9VkPlD8C
 hZJAMwY8nFa4gfUNovRST/zp1xzyhPudqbOzrqK0TbwfZcs+RqdbIHHSIXnFhmm64x+Z
 p2dXTIXwvvtgaVTV/lGdfkD8o0UYfAXVk1fe5MyZYA9zL/1o9SEJ1wtpCP3NHFE0n+N7
 y+FOf0X2MYaE3vV56JfKxrfAUTvWytcssO+yHYU3LCrBPIJe3ejxYUkZ6La8OafSWo+L
 JiEfApJLGaRXNJp5UvB/Z7IvjXquwAKZgX6R0QFlHIjr36R/wcGghk84vaX3q4Y1WtVl
 weuA==
X-Gm-Message-State: ALyK8tKJlhdBYotY5jRiKon90vL385K+9oanXqSlAkFiQAhyL9jnK+oSpnxc027KrIhcVCG7
X-Received: by 10.195.9.97 with SMTP id dr1mr1148827wjd.69.1465332254196;
 Tue, 07 Jun 2016 13:44:14 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id kd7sm27119494wjc.33.2016.06.07.13.44.11
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 13:44:12 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk>
 <d9fb93a6-d3ad-7009-3301-d6bd29be376b@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <782184e7-0e99-63a3-8f40-8d2452d344ac@multiplay.co.uk>
Date: Tue, 7 Jun 2016 21:44:17 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <d9fb93a6-d3ad-7009-3301-d6bd29be376b@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 20:44:16 -0000

On 07/06/2016 20:53, list-news wrote:
> I don't believe the mainboard has any SATA ports.  It does have a PCIe 
> slot IIRC though, and I may be able to rig something up with another 
> LSI adapter I have laying around.  If I can get it to fit and find a 
> way to power the drives.
>
> Although, this seems unlikely unless you are seeing something I'm not?
Nope but your assuming that the backplane doesn't have designed issue, 
and unfortunately that's more common than most people know so my process 
it to always fall back to lowest common denominator and directly attach 
the disks to the controller.
>
> With that last test: If it's the SAS controller, 3 different ones 
> running two different firmware versions are all causing the issue.  If 
> it's the backplane, I have now tested 3 of them as well, two of which 
> I can confirm have different revision numbers.
>
> Errors never appear with tags set to 1 for each drive (effectively 
> eliminating NCQ as I understand it).  My brief understanding is that a 
> higher tag count allows the SAS adapter to send more commands to the 
> drive in parallel, allowing the drive to make the decisions about 
> command ordering.  If that is accurate, and the controller firmware 
> was bad, I assume this would be a far more common bug that would have 
> been fixed already.
>
> On the other hand, if it only happens during heavy SYNCHRONIZE CACHE 
> commands in parallel on certain Intel SSD's and only on controllers 
> (maybe 12gbps?) that can outrun the drive firmware or cause a race 
> condition (my suspicions here).  It seems far more likely this would 
> have gone unnoticed by Intel.

All possible, but discount the easy first. If you have access to 2008 
based controller try that, they have always been solid here not used 
3008 yet.


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 21:22:42 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4F8A8B6E2BE
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 21:22:42 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x234.google.com (mail-wm0-x234.google.com
 [IPv6:2a00:1450:400c:c09::234])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 1C95017BE
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 21:22:41 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x234.google.com with SMTP id n184so154899384wmn.1
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 14:22:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=teMwACyi1sgW4KGvdaiLfbxdDrX5H/SsmN1iAphjsm4=;
 b=LJNiCRfsFHt9J9wg9bjIYML9Ydb4tgFeieyQHD/YWCOQHjKvbjODj7TjnZKga+GIjo
 5SBd27Sg49uGXYYxRo9R5Vcn7g5BA+TvrZhYmzXb/w4phGEXWsG0RbfWNFcNMwQ+OgNk
 7MnqOosy4H8lj9gmePO/XslE2GKw7TRazhHs24isbkgBquTYERmxx7/SDUxZE/rjtpDv
 pLBLWvpoUStKfOZJ33LsOBLlVk/js2i3Z55tNUa5IRAw0SqiDldG7wpvQp5T++8KxNsv
 kcFZ3PnSNdI5acLAiMptz0Nz+K5ttIGh1KN53NvUlwV9T2MxNXSyTUeU4jnHlBfOb0tU
 v9OA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=teMwACyi1sgW4KGvdaiLfbxdDrX5H/SsmN1iAphjsm4=;
 b=YFZg8j1wqlu01uVHE4FuCsKDhHFuQCItBXz9h6Uw3FzJycDz1FHsj9//ZkFf1Ng8CY
 cXTmMRrxJh9HYa+2AIGNd7X22Dsz6lmDWyUPb8wbTbZCR6z+FcB/75B6Vd5n94mYg6/a
 LTKbZPjjP6/yO8Z40ufT0gIZ6FbaDuHRH0oIJTd6KpYNt8JuEtbqJ9U9N7gw3nExiClU
 C8ciVFYjHTsl6SnKprhOv6CgikXwxDlPXWM6Ba0qL6c1KBGPlNJRduZ8OLn03BTeJVlf
 Eo2C0iBPxsdD1wbGcGAGCokIHmSUwnzRakbL2BNhMCoJ/XN4Lcr5t4tVAS8nBCSCzXhX
 q3BQ==
X-Gm-Message-State: ALyK8tKOQDmSWikzyqk54oyJBbygwwmTPt7wZdZftHdjd238RF1smiQtw8wgTplk35GoQNni
X-Received: by 10.195.9.97 with SMTP id dr1mr1255631wjd.69.1465334553598;
 Tue, 07 Jun 2016 14:22:33 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id o4sm27258721wjx.45.2016.06.07.14.22.31
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 14:22:31 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
 <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
Date: Tue, 7 Jun 2016 22:22:37 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 21:22:42 -0000

Always da6?

On 07/06/2016 21:19, list-news wrote:
> Sure Steve:
>
> # cat /boot/loader.conf | grep trim
> vfs.zfs.trim.enabled=0
>
> # sysctl vfs.zfs.trim.enabled
> vfs.zfs.trim.enabled: 0
>
> # uptime
> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>
> # tail -f /var/log/messages:
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
> allocated tm 0xfffffe0001322150
> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
> Aborting command 0xfffffe0001375580
> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort 
> for target ID 16
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout 
> cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
> connector name (    )
> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
> Jun  7 15:13:50 s18 kernel: AckRequired: 0
> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
> 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b 
> scsi 0 state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
> 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
> 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b 
> scsi 0 state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 
> 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 
> 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b 
> scsi 0 state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 
> 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 
> state c xfer 0
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 
> scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 
> 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 
> 804b scsi 0 sta(da6:te c xfer 0
> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
> TaskMID 1016 status 0x0 code 0x0 count 5
> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 finished 
> recovery after aborting TaskMID 1016
> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
> Jun  7 15:13:50 s18 kernel: AckRequired: 0
> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
> Jun  7 15:13:50 s18 kernel: NumPhys: 31
> Jun  7 15:13:50 s18 kernel: NumEntries: 1
> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
> Event: [1c]
> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
> Jun  7 15:13:50 s18 kernel: AckRequired: 0
> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
> Jun  7 15:13:50 s18 kernel: Flags: 0
> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
> Event: [16]
> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
> CDB: 35 00 00 00 00 00 00 00 00 00
> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status 
> Error
> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
> Condition
> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
> sense data)
>
> -Kyle
>
> On 6/7/16 2:53 PM, Steven Hartland wrote:
>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
>> delete but assuming your running ZFS can you set the following in 
>> loader.conf and see how you get on.
>> vfs.zfs.trim.enabled=0
>>
>>     Regards
>>     Steve
>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 22:43:21 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27494B6D741
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 22:43:21 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id E938C1BB9
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 22:43:19 +0000 (UTC)
 (envelope-from list-news@mindpackstudios.com)
Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10])
 by mail.furymx.com (Postfix) with ESMTP id 014061ED6B9
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 17:43:13 -0500 (CDT)
X-Virus-Scanned: amavisd-new at furymx.com
Received: from mail.furymx.com ([10.10.1.10])
 by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024)
 with ESMTP id 6iyAs-CUw2iI for <freebsd-scsi@freebsd.org>;
 Tue,  7 Jun 2016 17:43:11 -0500 (CDT)
Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net
 [98.215.180.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 (Authenticated sender: kyle@mindpackstudios.com)
 by mail.furymx.com (Postfix) with ESMTPSA id 2C2091ED6AE
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 17:43:11 -0500 (CDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
 <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
 <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
From: list-news <list-news@mindpackstudios.com>
Message-ID: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
Date: Tue, 7 Jun 2016 17:43:10 -0500
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0)
 Gecko/20100101 Thunderbird/45.1.1
MIME-Version: 1.0
In-Reply-To: <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 22:43:21 -0000

No, it threw errors on both da6 and da7 and then I stopped it.

Your last e-mail gave me thoughts though.  I have a server with 2008 
controllers (entirely different backplane design, cpu, memory, etc).  
I've moved the 4 drives to that and I'm running the test now.

# uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
# sysctl dev.mps.0
dev.mps.0.spinup_wait_time: 3
dev.mps.0.chain_alloc_fail: 0
dev.mps.0.enable_ssu: 1
dev.mps.0.max_chains: 2048
dev.mps.0.chain_free_lowwater: 1176
dev.mps.0.chain_free: 2048
dev.mps.0.io_cmds_highwater: 510
dev.mps.0.io_cmds_active: 0
dev.mps.0.driver_version: 20.00.00.00-fbsd
dev.mps.0.firmware_version: 17.00.01.00
dev.mps.0.disable_msi: 0
dev.mps.0.disable_msix: 0
dev.mps.0.debug_level: 3
dev.mps.0.%parent: pci5
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
subdevice=0x3020 class=0x010700
dev.mps.0.%location: slot=0 function=0
dev.mps.0.%driver: mps
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008

About 1.5 hours has passed at full load, no errors.

gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
Overall throughput seems to be 20-30% less with my rough benchmarks.

I'm not sure if this gets us closer to the answer, if this doesn't 
time-out on the 2008 controller, it looks like one of these:
1) The Intel drive firmware is being overloaded somehow when connected 
to the 3008.
or
2) The 3008 firmware or driver has an issue reading drive responses, 
sporadically thinking the command timed-out (when maybe it really didn't).

Puzzle pieces:
A) Why does setting tags of 1 on drives connected to the 3008 fix the 
problem?
B) With tags of 255.  Why does postgres (and assuming a large fsync 
count), seem to cause the problem within minutes?  While running other 
heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
similarly high or higher iops take hours to cause the problem (if ever).

I'll let this continue to run to further test.

Thanks again for all the help.

-Kyle

On 6/7/16 4:22 PM, Steven Hartland wrote:
> Always da6?
>
> On 07/06/2016 21:19, list-news wrote:
>> Sure Steve:
>>
>> # cat /boot/loader.conf | grep trim
>> vfs.zfs.trim.enabled=0
>>
>> # sysctl vfs.zfs.trim.enabled
>> vfs.zfs.trim.enabled: 0
>>
>> # uptime
>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>
>> # tail -f /var/log/messages:
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
>> allocated tm 0xfffffe0001322150
>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
>> Aborting command 0xfffffe0001375580
>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from 
>> mprsas_send_abort for target ID 16
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout 
>> cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>> connector name (    )
>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b 
>> scsi 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 
>> 0 state c xfer 0
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 
>> scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 
>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 
>> 804b scsi 0 sta(da6:te c xfer 0
>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
>> TaskMID 1016 status 0x0 code 0x0 count 5
>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 
>> finished recovery after aborting TaskMID 1016
>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
>> Event: [1c]
>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>> Jun  7 15:13:50 s18 kernel: Flags: 0
>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
>> Event: [16]
>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). 
>> CDB: 35 00 00 00 00 00 00 00 00 00
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI 
>> Status Error
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
>> Condition
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
>> sense data)
>>
>> -Kyle
>>
>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO 
>>> delete but assuming your running ZFS can you set the following in 
>>> loader.conf and see how you get on.
>>> vfs.zfs.trim.enabled=0
>>>
>>>     Regards
>>>     Steve
>>
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 23:28:37 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0E6B2B6DEE3
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 23:28:37 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x231.google.com (mail-wm0-x231.google.com
 [IPv6:2a00:1450:400c:c09::231])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 9AEDF1C3F
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 23:28:36 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x231.google.com with SMTP id k204so89100473wmk.0
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 16:28:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=tnFN9BhE4FbQKpErinXYONtIG6CCIoCOSzMUFwLt9Ws=;
 b=YWoqU0qk7MNPVPY1UvaNEC8GBhkfgcOLWxp4OHHqyGspro+A8cvJAeEIGo4aTJkyjR
 OdX2Q4/FwoUzn5cycizRKNQZn6qQ/+pltWK3nrXXjxWjWgmwrLNs/sfyrryxkWkrjU8A
 RnOoGY78B3mVOaa2/a5VhFe1qz5UHq2AdSxSHqQFH18yHNsWisKoLQBdPwyNm7tAlpjI
 w6Nm2MXaNukfUPZFdKKbzz7z6lRLRHZvrwBOJlllSM24nkfhx9ncgU2ZsRbgaewe0W3g
 4AOQTfoYU38rMlayNt81xq8tcGbuS1lXkzSUDpxbWnQu70UBLm61cc4ErWRdPAgVGyk2
 KPBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=tnFN9BhE4FbQKpErinXYONtIG6CCIoCOSzMUFwLt9Ws=;
 b=V1Ot6IpBzDFJsqo8mze4QzwiGb3i7JEVFYtOWORPXRafWBSYZ8IzqFVyyy7nhne35E
 w6pKzmUG0YY+DmuZu8uYPYrTjHdB9Hs/krI1X4ivqAcpitNB7//DWoAT2a7/m6ghvm31
 F7cG0HdCBJJwr5Q18sNUjOp0MKdHHS8grvgaboFS1FKsp+EIWaaQcnl+rj9jv0CO2wV4
 ZO44BQ+PiIEI09qP43hRBYdlYvF0OLJNeDU5QbuQMTtDmI+NXqo03yzVICVy1mYdccjP
 RfRhX/1EycrRifmo4rMh2zmzUv8MOMsRHnCPrjlEYSh5vcbXkh0OJgGtpRAcSlmqOSep
 MiYQ==
X-Gm-Message-State: ALyK8tKo9+yR8ga10ypevPmevWO2irjWT0B2YeulMdLPxQCONIV7gJ17K63FY4Jmz8dJggjv
X-Received: by 10.28.132.144 with SMTP id g138mr4836615wmd.47.1465342113847;
 Tue, 07 Jun 2016 16:28:33 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id d195sm21730589wmd.12.2016.06.07.16.28.32
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 16:28:32 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
 <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
 <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
 <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <73dd23bd-7989-6dde-f3ff-e6e51610390a@multiplay.co.uk>
Date: Wed, 8 Jun 2016 00:28:38 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 23:28:37 -0000

If that works I'd switch the 3008 into the machine with 2008 in 
currently and retest.  That will help to confirm the 3008 card and 
driver is or isn't a potential issue.

On 07/06/2016 23:43, list-news wrote:
> No, it threw errors on both da6 and da7 and then I stopped it.
>
> Your last e-mail gave me thoughts though.  I have a server with 2008 
> controllers (entirely different backplane design, cpu, memory, etc).  
> I've moved the 4 drives to that and I'm running the test now.
>
> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
> # sysctl dev.mps.0
> dev.mps.0.spinup_wait_time: 3
> dev.mps.0.chain_alloc_fail: 0
> dev.mps.0.enable_ssu: 1
> dev.mps.0.max_chains: 2048
> dev.mps.0.chain_free_lowwater: 1176
> dev.mps.0.chain_free: 2048
> dev.mps.0.io_cmds_highwater: 510
> dev.mps.0.io_cmds_active: 0
> dev.mps.0.driver_version: 20.00.00.00-fbsd
> dev.mps.0.firmware_version: 17.00.01.00
> dev.mps.0.disable_msi: 0
> dev.mps.0.disable_msix: 0
> dev.mps.0.debug_level: 3
> dev.mps.0.%parent: pci5
> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
> subdevice=0x3020 class=0x010700
> dev.mps.0.%location: slot=0 function=0
> dev.mps.0.%driver: mps
> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>
> About 1.5 hours has passed at full load, no errors.
>
> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
> Overall throughput seems to be 20-30% less with my rough benchmarks.
>
> I'm not sure if this gets us closer to the answer, if this doesn't 
> time-out on the 2008 controller, it looks like one of these:
> 1) The Intel drive firmware is being overloaded somehow when connected 
> to the 3008.
> or
> 2) The 3008 firmware or driver has an issue reading drive responses, 
> sporadically thinking the command timed-out (when maybe it really 
> didn't).
>
> Puzzle pieces:
> A) Why does setting tags of 1 on drives connected to the 3008 fix the 
> problem?
> B) With tags of 255.  Why does postgres (and assuming a large fsync 
> count), seem to cause the problem within minutes?  While running other 
> heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
> similarly high or higher iops take hours to cause the problem (if ever).
>
> I'll let this continue to run to further test.
>
> Thanks again for all the help.


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 23:30:17 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id DEAFFB6DF5E
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 23:30:17 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wm0-x22c.google.com (mail-wm0-x22c.google.com
 [IPv6:2a00:1450:400c:c09::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 60CC11CA2
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 23:30:17 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by mail-wm0-x22c.google.com with SMTP id v199so40150582wmv.0
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 16:30:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623;
 h=subject:to:references:from:message-id:date:user-agent:mime-version
 :in-reply-to; bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=;
 b=mwFWGq6LF1yQl3jaiYSEVO3msN5OeZAZTkhrRiDY9ufKoJY0Y2GgSUXg6xliw5xg0N
 5413v2GmCjt8+wUMgBWyy3+WhMIpH/05zVo80z8OmpzDiclm16IpO9sYVspvj5JrActo
 K/HdTE6uwtfX+FMOekoUJXXe8QMfP1m8vLylbtClaaIyshtUKBmPv5kKLi7Z78Uzqmti
 orHdB030+tkQG0sruvd3gjcBFX5g62LOenMBtpyRDpqM8HilHHAHkBo9c8i3G6uv/Iqu
 jqCiX/cQfyWbQxI2ECj7bDaQ2xRQjdXv+1VFyE02OuNkCmd9C2hOhzVpGRkFQ1ABLDoo
 cFaw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to;
 bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=;
 b=HZtbFtC0yYzgCvD/Heu8TylN0EV8fpRG+NPBDpPx7J1Gluq+C6XRpvswpCxIRYNIiD
 X9AwvhmjogVAWURUlj4xEWTRaYWBN4vhQcEmG0jekPRExTfHMJ5XFNHAcCYunNhqm6Fd
 4dXlNwI8sQqWUQBPQKpzB5VBJB4/yvsav2FPpnbcm/fVULuerkWFbNghwChlLkLR03aE
 v+ax9ornPICEFNH0PQ9S0XtZtD+7DWo/tGmgUx6kifbYw5DPexe/eOYORygbjwWvF4qh
 H5Bj4Tk4g5P2XlTm1MHcuTR6Xm1CdsKq6tSaQqkqYHZ/YyMJg3YsGY4fuo+thHzo7YJ9
 isCg==
X-Gm-Message-State: ALyK8tLAsBGlvMGzqvBbrEXdUjCU5S+Pfe+V9HaWQlS6PzbXcQ1o3c4oWAMXOhjcLHCo2xig
X-Received: by 10.194.123.9 with SMTP id lw9mr1713992wjb.53.1465342215759;
 Tue, 07 Jun 2016 16:30:15 -0700 (PDT)
Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171])
 by smtp.gmail.com with ESMTPSA id o76sm21768597wme.0.2016.06.07.16.30.14
 for <freebsd-scsi@freebsd.org> (version=TLSv1/SSLv3 cipher=OTHER);
 Tue, 07 Jun 2016 16:30:14 -0700 (PDT)
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
To: freebsd-scsi@freebsd.org
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
 <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
 <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
 <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>
Date: Wed, 8 Jun 2016 00:30:19 +0100
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101
 Thunderbird/45.1.0
MIME-Version: 1.0
In-Reply-To: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-Content-Filtered-By: Mailman/MimeDel 2.1.22
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 23:30:18 -0000

Oh another thing to test is iirc 3008 is supported by mrsas so you might 
want to try adding the following into loader.conf to switch drivers:
hw.mfi.mrsas_enable="1"

On 07/06/2016 23:43, list-news wrote:
> No, it threw errors on both da6 and da7 and then I stopped it.
>
> Your last e-mail gave me thoughts though.  I have a server with 2008 
> controllers (entirely different backplane design, cpu, memory, etc).  
> I've moved the 4 drives to that and I'm running the test now.
>
> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
> # sysctl dev.mps.0
> dev.mps.0.spinup_wait_time: 3
> dev.mps.0.chain_alloc_fail: 0
> dev.mps.0.enable_ssu: 1
> dev.mps.0.max_chains: 2048
> dev.mps.0.chain_free_lowwater: 1176
> dev.mps.0.chain_free: 2048
> dev.mps.0.io_cmds_highwater: 510
> dev.mps.0.io_cmds_active: 0
> dev.mps.0.driver_version: 20.00.00.00-fbsd
> dev.mps.0.firmware_version: 17.00.01.00
> dev.mps.0.disable_msi: 0
> dev.mps.0.disable_msix: 0
> dev.mps.0.debug_level: 3
> dev.mps.0.%parent: pci5
> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 
> subdevice=0x3020 class=0x010700
> dev.mps.0.%location: slot=0 function=0
> dev.mps.0.%driver: mps
> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>
> About 1.5 hours has passed at full load, no errors.
>
> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  
> Overall throughput seems to be 20-30% less with my rough benchmarks.
>
> I'm not sure if this gets us closer to the answer, if this doesn't 
> time-out on the 2008 controller, it looks like one of these:
> 1) The Intel drive firmware is being overloaded somehow when connected 
> to the 3008.
> or
> 2) The 3008 firmware or driver has an issue reading drive responses, 
> sporadically thinking the command timed-out (when maybe it really 
> didn't).
>
> Puzzle pieces:
> A) Why does setting tags of 1 on drives connected to the 3008 fix the 
> problem?
> B) With tags of 255.  Why does postgres (and assuming a large fsync 
> count), seem to cause the problem within minutes?  While running other 
> heavy i/o commands (zpool scrub, bonnie++, fio), all of which show 
> similarly high or higher iops take hours to cause the problem (if ever).
>
> I'll let this continue to run to further test.
>
> Thanks again for all the help.
>
> -Kyle
>
> On 6/7/16 4:22 PM, Steven Hartland wrote:
>> Always da6?
>>
>> On 07/06/2016 21:19, list-news wrote:
>>> Sure Steve:
>>>
>>> # cat /boot/loader.conf | grep trim
>>> vfs.zfs.trim.enabled=0
>>>
>>> # sysctl vfs.zfs.trim.enabled
>>> vfs.zfs.trim.enabled: 0
>>>
>>> # uptime
>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>
>>> # tail -f /var/log/messages:
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 
>>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>> connector name (    )
>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 
>>> allocated tm 0xfffffe0001322150
>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 
>>> Aborting command 0xfffffe0001375580
>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from 
>>> mprsas_send_abort for target ID 16
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 
>>> 16, handle(0x0010)
>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, 
>>> connector name (    )
>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm 
>>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 
>>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 
>>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 
>>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 
>>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b 
>>> scsi 0 state c xfer 0
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 
>>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 
>>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 
>>> scsi 0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 
>>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 
>>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 
>>> b0 00
>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during 
>>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 
>>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort 
>>> TaskMID 1016 status 0x0 code 0x0 count 5
>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 
>>> finished recovery after aborting TaskMID 1016
>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  
>>> Event: [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: 
>>> [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  
>>> Event: [1c]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: 
>>> [1c]
>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  
>>> Event: [16]
>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: 
>>> [16]
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE 
>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI 
>>> Status Error
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check 
>>> Condition
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT 
>>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per 
>>> sense data)
>>>
>>> -Kyle
>>>
>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using 
>>>> BIO delete but assuming your running ZFS can you set the following 
>>>> in loader.conf and see how you get on.
>>>> vfs.zfs.trim.enabled=0
>>>>
>>>>     Regards
>>>>     Steve
>>>
>>>
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>
>
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Tue Jun  7 23:39:45 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CB703B6E239
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Tue,  7 Jun 2016 23:39:45 +0000 (UTC)
 (envelope-from david@gwynne.id.au)
Received: from mail-pf0-x22f.google.com (mail-pf0-x22f.google.com
 [IPv6:2607:f8b0:400e:c00::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id A102C11E5
 for <freebsd-scsi@freebsd.org>; Tue,  7 Jun 2016 23:39:45 +0000 (UTC)
 (envelope-from david@gwynne.id.au)
Received: by mail-pf0-x22f.google.com with SMTP id 62so82376782pfd.1
 for <freebsd-scsi@freebsd.org>; Tue, 07 Jun 2016 16:39:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gwynne-id-au.20150623.gappssmtp.com; s=20150623;
 h=mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=rk2IlZGiblIVZdL+OB/GGS15X2/fOyMyQ88p45n9X1g=;
 b=BjDuoJ+r+OGah2hKqNbwxOCS0FKsHmfHKrNFC7E+fJpxPkLljZmNZul2ADXGGPdazK
 tPGrnDQ6oP4DZmx0so3Sm0F5u3FVMganJX22vV95lkORgn25SFmaRzLdQyrAOoFhRL4o
 AEZSEahq2GdzmatQkowHa1DV6uWwLIoXLvH4lj4eaLMnZRWm30Y1D/Jos2r00k8VN6XV
 mLHV+vHgx4xKNy1mSpolCkECzxdLU35UxwFJCHR3YgD1XK68bBBd0VXb8juWt71xAcxh
 kG5SqCAK81GSoOLrlQzEMQOBhL7Ki4Ad6vxv0kBdfQrJwOVeTiScrJ752zw8hbkO0cOe
 CLBw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=rk2IlZGiblIVZdL+OB/GGS15X2/fOyMyQ88p45n9X1g=;
 b=fUfxmwZ05Aple/EgmHkxDAJgC0Sx3GZDkgGhimHmKFQHD9Bqbxf4NRHfexMulRt8Fl
 OZuDx/+DPKqBiQcYxiPpHm64zzvkbGJKeUt787zxPuuYcCv83fz5jnW0BSyAcg6s/e06
 h085kK7sXqkXIDD4HY/YlWLshhqtPOmrvoMCmIiviOEbJEqLckvJxuj5YncN2DZF2rsT
 R1ypIYupii7QT8cnINL3L7wT9YKk+3icVsOCEfTpkOlQJe141PDE/CV7DjCmGtPWK5bG
 /fgsj7dZIHKFh5vej1A0LNBQl9vkKPo7jQFnLBklh44KnkYYMLHx51r6SYf4I+nRcfs7
 v+lA==
X-Gm-Message-State: ALyK8tLJDWZzlkEA/pxwzsdASEJC+Uc4Eavflo46O8HnS2ZdeJgXf/A63dAdleX5bcnwuw==
X-Received: by 10.98.58.77 with SMTP id h74mr2154346pfa.156.1465342784506;
 Tue, 07 Jun 2016 16:39:44 -0700 (PDT)
Received: from opiate.eait.uq.edu.au (a82-177.nat.uq.edu.au. [130.102.82.177])
 by smtp.gmail.com with ESMTPSA id
 129sm11832387pfe.3.2016.06.07.16.39.41
 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128);
 Tue, 07 Jun 2016 16:39:43 -0700 (PDT)
Content-Type: text/plain; charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\))
Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts
From: David Gwynne <david@gwynne.id.au>
In-Reply-To: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>
Date: Wed, 8 Jun 2016 09:39:38 +1000
Cc: freebsd-scsi@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au>
References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com>
 <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk>
 <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es>
 <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com>
 <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es>
 <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
 <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com>
 <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk>
 <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com>
 <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk>
 <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com>
 <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>
To: Steven Hartland <killing@multiplay.co.uk>
X-Mailer: Apple Mail (2.3124)
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 07 Jun 2016 23:39:45 -0000


> On 8 Jun 2016, at 09:30, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> Oh another thing to test is iirc 3008 is supported by mrsas so you =
might want to try adding the following into loader.conf to switch =
drivers:
> hw.mfi.mrsas_enable=3D"1"

i believe the 3008s can run two different firmwares, one that provides =
the mpt2 interface and the other than provides the megaraid sas fusion =
interface. you have to flash them to switch though, you cant just point =
a driver at it and hope for the best.

each fw presents different pci ids. eg, in =
http://pciids.sourceforge.net/v2.2/pci.ids you can see:

	005f  MegaRAID SAS-3 3008 [Fury]
	0097  SAS3008 PCI-Express Fusion-MPT SAS-3

dlg

>=20
> On 07/06/2016 23:43, list-news wrote:
>> No, it threw errors on both da6 and da7 and then I stopped it.
>>=20
>> Your last e-mail gave me thoughts though.  I have a server with 2008 =
controllers (entirely different backplane design, cpu, memory, etc).  =
I've moved the 4 drives to that and I'm running the test now.
>>=20
>> # uname =3D FreeBSD 10.2-RELEASE-p12 #1 r296215
>> # sysctl dev.mps.0
>> dev.mps.0.spinup_wait_time: 3
>> dev.mps.0.chain_alloc_fail: 0
>> dev.mps.0.enable_ssu: 1
>> dev.mps.0.max_chains: 2048
>> dev.mps.0.chain_free_lowwater: 1176
>> dev.mps.0.chain_free: 2048
>> dev.mps.0.io_cmds_highwater: 510
>> dev.mps.0.io_cmds_active: 0
>> dev.mps.0.driver_version: 20.00.00.00-fbsd
>> dev.mps.0.firmware_version: 17.00.01.00
>> dev.mps.0.disable_msi: 0
>> dev.mps.0.disable_msix: 0
>> dev.mps.0.debug_level: 3
>> dev.mps.0.%parent: pci5
>> dev.mps.0.%pnpinfo: vendor=3D0x1000 device=3D0x0072 subvendor=3D0x1000 =
subdevice=3D0x3020 class=3D0x010700
>> dev.mps.0.%location: slot=3D0 function=3D0
>> dev.mps.0.%driver: mps
>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>>=20
>> About 1.5 hours has passed at full load, no errors.
>>=20
>> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  =
Overall throughput seems to be 20-30% less with my rough benchmarks.
>>=20
>> I'm not sure if this gets us closer to the answer, if this doesn't =
time-out on the 2008 controller, it looks like one of these:
>> 1) The Intel drive firmware is being overloaded somehow when =
connected to the 3008.
>> or
>> 2) The 3008 firmware or driver has an issue reading drive responses, =
sporadically thinking the command timed-out (when maybe it really =
didn't).
>>=20
>> Puzzle pieces:
>> A) Why does setting tags of 1 on drives connected to the 3008 fix the =
problem?
>> B) With tags of 255.  Why does postgres (and assuming a large fsync =
count), seem to cause the problem within minutes?  While running other =
heavy i/o commands (zpool scrub, bonnie++, fio), all of which show =
similarly high or higher iops take hours to cause the problem (if ever).
>>=20
>> I'll let this continue to run to further test.
>>=20
>> Thanks again for all the help.
>>=20
>> -Kyle
>>=20
>> On 6/7/16 4:22 PM, Steven Hartland wrote:
>>> Always da6?
>>>=20
>>> On 07/06/2016 21:19, list-news wrote:
>>>> Sure Steve:
>>>>=20
>>>> # cat /boot/loader.conf | grep trim
>>>> vfs.zfs.trim.enabled=3D0
>>>>=20
>>>> # sysctl vfs.zfs.trim.enabled
>>>> vfs.zfs.trim.enabled: 0
>>>>=20
>>>> # uptime
>>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>>=20
>>>> # tail -f /var/log/messages:
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a =
00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm =
0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, =
connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 =
allocated tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 =
Aborting command 0xfffffe0001375580
>>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from =
mprsas_send_abort for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command =
timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, =
handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, =
connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm =
0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm =
0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm =
0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm =
0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a =
00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm =
0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi =
0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 =
00 00 00 00 00 00 length 0 SMID 786 completed timedout cm =
0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 =
during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command =
timeout
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 =
terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 =
abort TaskMID 1016 status 0x0 code 0x0 count 5
>>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 =
finished recovery after aborting TaskMID 1016
>>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  =
Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: =
[16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  =
Event: [1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: =
[1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  =
Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: =
[16]
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI =
Status Error
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check =
Condition
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT =
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command =
(per sense data)
>>>>=20
>>>> -Kyle
>>>>=20
>>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using =
BIO delete but assuming your running ZFS can you set the following in =
loader.conf and see how you get on.
>>>>> vfs.zfs.trim.enabled=3D0
>>>>>=20
>>>>>    Regards
>>>>>    Steve
>>>>=20
>>>>=20
>>>> _______________________________________________
>>>> freebsd-scsi@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>>=20
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>=20
>>=20
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>=20
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"


From owner-freebsd-scsi@freebsd.org  Fri Jun 10 09:33:24 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0D8D7B703B6
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri, 10 Jun 2016 09:33:24 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id F21BC1A7B
 for <freebsd-scsi@FreeBSD.org>; Fri, 10 Jun 2016 09:33:23 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5A9XNsJ066035
 for <freebsd-scsi@FreeBSD.org>; Fri, 10 Jun 2016 09:33:23 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-scsi@FreeBSD.org
Subject: [Bug 202625] [cam][libcam][patch] PERSISTENT RESERVE OUT needs
 scsi_cmd->length to be populated
Date: Fri, 10 Jun 2016 09:33:23 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 10.2-RELEASE
X-Bugzilla-Keywords: patch
X-Bugzilla-Severity: Affects Many People
X-Bugzilla-Who: andrew.hotlab@hotmail.com
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc
Message-ID: <bug-202625-5312-8ezSeA5Yic@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-202625-5312@https.bugs.freebsd.org/bugzilla/>
References: <bug-202625-5312@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Jun 2016 09:33:24 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D202625

Andrew <andrew.hotlab@hotmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |freebsd-scsi@FreeBSD.org

--- Comment #2 from Andrew <andrew.hotlab@hotmail.com> ---
Adding the freebsd-scsi list to the discussion, hoping that a committer cou=
ld
notice it and commit this patch. Thanks!

--Andrew

--=20
You are receiving this mail because:
You are on the CC list for the bug.=

From owner-freebsd-scsi@freebsd.org  Fri Jun 10 16:36:34 2016
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 13034AD92C7
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Fri, 10 Jun 2016 16:36:34 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 0372B2F6A
 for <freebsd-scsi@FreeBSD.org>; Fri, 10 Jun 2016 16:36:34 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5AGaXkl095038
 for <freebsd-scsi@FreeBSD.org>; Fri, 10 Jun 2016 16:36:33 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-scsi@FreeBSD.org
Subject: [Bug 202625] [cam][libcam][patch] PERSISTENT RESERVE OUT needs
 scsi_cmd->length to be populated
Date: Fri, 10 Jun 2016 16:36:34 +0000
X-Bugzilla-Reason: CC
X-Bugzilla-Type: changed
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 10.2-RELEASE
X-Bugzilla-Keywords: patch
X-Bugzilla-Severity: Affects Many People
X-Bugzilla-Who: asomers@FreeBSD.org
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: ken@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: cc assigned_to
Message-ID: <bug-202625-5312-caW45Jj2iu@https.bugs.freebsd.org/bugzilla/>
In-Reply-To: <bug-202625-5312@https.bugs.freebsd.org/bugzilla/>
References: <bug-202625-5312@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.22
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Jun 2016 16:36:34 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D202625

Alan Somers <asomers@FreeBSD.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |asomers@FreeBSD.org
           Assignee|freebsd-bugs@FreeBSD.org    |ken@FreeBSD.org

--=20
You are receiving this mail because:
You are on the CC list for the bug.=