From owner-freebsd-scsi@freebsd.org Mon Jun 6 08:51:43 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7F4CDB6D900 for ; Mon, 6 Jun 2016 08:51:43 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com [195.16.151.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 486991798 for ; Mon, 6 Jun 2016 08:51:42 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop01.sare.net (Postfix) with ESMTPSA id A72EA9DD7CD; Mon, 6 Jun 2016 10:42:32 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts From: Borja Marcos In-Reply-To: Date: Mon, 6 Jun 2016 10:42:32 +0200 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> To: Steven Hartland X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jun 2016 08:51:43 -0000 > On 03 Jun 2016, at 23:49, Steven Hartland = wrote: >=20 > First thing would be to run gstat with -d to see if you're actually = stacking up deletes, a symptom of which can be r/w dropping to zero. >=20 > If you are seeing significant deletes it could be a FW issue on the = drives. Hmm. I=E2=80=99ve suffered that badly with Intel P3500 NVMe drives, = which suffer at least from a driver problem: trims are not coalesced.=20 However I didn=E2=80=99t experience command timeouts. Reads and, = especially, writes, stalled badly. A quick test for trim related trouble is setting the sysctl variable = vfs.zfs.vdev.bio_delete_disable to 1. It doesn=C2=B4t require a reboot and you can quickly compare results. In my case, a somewhat similar problem in an IBM server was caused by a = faulty LSI3008 card it seems. As I didn=C2=B4t have spare LSI3008 cards at the time I replaced it by a LSI2008 and everything works perfectly. = Before anyone chimes in suggesting card incompatibility of some sort, I have a twin system with a LSI3008 working like a charm. ;) Borja. From owner-freebsd-scsi@freebsd.org Mon Jun 6 22:19:12 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E40F3B63C72 for ; Mon, 6 Jun 2016 22:19:12 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C196317B5 for ; Mon, 6 Jun 2016 22:19:11 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id D084921AA6C for ; Mon, 6 Jun 2016 17:19:04 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id QU7in_cUxwyD for ; Mon, 6 Jun 2016 17:19:02 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id C90D221AA58 for ; Mon, 6 Jun 2016 17:19:02 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> From: list-news Message-ID: <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> Date: Mon, 6 Jun 2016 17:19:02 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 06 Jun 2016 22:19:13 -0000 System was running solid all weekend with camcontrol tags set to 1 for each device, zero errors. Last week I did try *# sysctl kern.cam.da.X.delete_method=**DISABLE* for each drive, but it still threw errors. Also, I did try out bio_delete_disable earlier today: *# camcontrol tags daX -N 255* (Firstly resetting tags back to 255 for each device, as they are currently at 1.) *# sysctl vfs.zfs.vdev.bio_delete_disable=1* (a few minutes later) Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 577 command timeout cm 0xfffffe0001351550 ccb 0xfffff804e78e3800 target 12, handle(0x000c) Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4, connector name ( ) Jun 6 12:28:36 s18 kernel: mpr0: timedout cm 0xfffffe0001351550 allocated tm 0xfffffe0001322150 Jun 6 12:28:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001351550 Jun 6 12:28:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 12 Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 18 45 1c c0 00 00 08 00 length 4096 SMID 583 command timeout cm 0xfffffe0001351d30 ccb 0xfffff806b9556800 target 12, handle(0x000c) Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4, connector name ( ) Jun 6 12:28:36 s18 kernel: mpr0: queued timedout cm 0xfffffe0001351d30 for processing by tm 0xfffffe0001322150 ... During the 60 second hang: *# gstat -do* L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d o/s ms/o %busy Name 70 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da4 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da7 Also during the 60 second hang: *# camcontrol tags da3 -v* (pass2:mpr0:0:12:0): dev_openings 248 (pass2:mpr0:0:12:0): dev_active 7 (pass2:mpr0:0:12:0): allocated 7 (pass2:mpr0:0:12:0): queued 0 (pass2:mpr0:0:12:0): held 0 (pass2:mpr0:0:12:0): mintags 2 (pass2:mpr0:0:12:0): maxtags 255 Also during the 60 second hang: *# sysctl dev.mpr* dev.mpr.0.spinup_wait_time: 3 dev.mpr.0.chain_alloc_fail: 0 dev.mpr.0.enable_ssu: 1 dev.mpr.0.max_chains: 2048 dev.mpr.0.chain_free_lowwater: 2022 dev.mpr.0.chain_free: 2048 dev.mpr.0.io_cmds_highwater: 71 dev.mpr.0.io_cmds_active: 4 dev.mpr.0.driver_version: 09.255.01.00-fbsd dev.mpr.0.firmware_version: 10.00.03.00 dev.mpr.0.disable_msi: 0 dev.mpr.0.disable_msix: 0 dev.mpr.0.debug_level: 895 dev.mpr.0.%parent: pci1 dev.mpr.0.%pnpinfo: vendor=0x1000 device=0x0097 subvendor=0x15d9 subdevice=0x0808 class=0x010700 dev.mpr.0.%location: pci0:1:0:0 handle=\_SB_.PCI0.BR1A.H000 dev.mpr.0.%driver: mpr dev.mpr.0.%desc: Avago Technologies (LSI) SAS3008 dev.mpr.%parent: Something else that may be of consideration: I ran fio & bonnie++ for about an hour of heavy io (with tags still set to 255 drive busy showing 90-100%). No errors. I fire up my application (threaded Java/Postgres application), and within minutes: *# gstat -do* dT: 1.002s w: 1.000s L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d o/s ms/o %busy Name 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da4 71 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da6 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da7 *Error:* Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 30 65 13 90 00 00 10 00 Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per sense data) ... *And again 2 minutes later:* Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 21 66 63 58 00 00 10 00 Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status Error Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check Condition Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per sense data) ... *And again 3 minutes later:* Jun 6 13:41:29 s18 kernel: (da7:mpr0:0:18:0): WRITE(10). CDB: 2a 00 33 44 b5 b8 00 00 10 00 ... *#camcontrol tags daX -N **1* (And now, after 15 minutes, zero errors.) In putting some thoughts to this, which may or may not be off base (please feel free to correct me btw), I've noticed the following: 1) There doesn't seem to be any indication as to what causes the drive to time-out. The command that fails in the error log is one of the following: READ(10), WRITE(10), ATA COMMAND PASS THROUGH(16), and SYNCHRONIZE CACHE(10). As I understand it, that was the command being executed, timed-out, and retried, not what potentially caused the drive lock-up. 2) When my application is run, it hammers postgres pretty hard, and when postgres is running I get the errors. FIO & Bonnie++ doesn't give me errors; daily use of the system doesn't give me errors. I'm assuming postgresql is sending far more of a certain type of command to the io subsystem than those other applications, and the first command that comes to mind is fsync. 3) I turned fsync off in postgresql.conf (I'm brave for science!!) then ran my application again with tags at 255, 100% cpu load, 70-80% drive busy%. *1.5 hours later at full load - finally, a single timeout:* Jun 6 16:31:33 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 2d 50 1b 78 00 00 08 00 length 4096 SMID 556 command timeout cm 0xfffffe000134f9c0 ccb 0xfffff83aa5b25000 target 12, handle(0x000c) I ran it for another 20 minutes with no additional timeouts. I assume the fsync command turns into a zfs -> cam -> SYNCHRONIZE CACHE command for each device. And postgres is sending this command considerably more often than a typical application (at least with fsync turned on in postgresql.conf), which would explain why when fsync is turned off or minimal fsyncs are being sent (ie typical system usage), the error is rare. Yet, when fsync is being sent repeatedly, the errors start happening every few minutes. The only reason I can think why setting tags to 1 eliminates the errors entirely must have something to do with Intel drives not handling parallel commands from cam when one (or more) of the commands are SYNCHRONIZE CACHE. Thoughts? Thanks, -Kyle On 6/6/16 3:42 AM, Borja Marcos wrote: >> On 03 Jun 2016, at 23:49, Steven Hartland wrote: >> >> First thing would be to run gstat with -d to see if you're actually stacking up deletes, a symptom of which can be r/w dropping to zero. >> >> If you are seeing significant deletes it could be a FW issue on the drives. > Hmm. I’ve suffered that badly with Intel P3500 NVMe drives, which suffer at least from a driver problem: trims are not coalesced. > However I didn’t experience command timeouts. Reads and, especially, writes, stalled badly. > > A quick test for trim related trouble is setting the sysctl variable vfs.zfs.vdev.bio_delete_disable to 1. It doesn´t require > a reboot and you can quickly compare results. > > In my case, a somewhat similar problem in an IBM server was caused by a faulty LSI3008 card it seems. As I didn´t have spare LSI3008 cards > at the time I replaced it by a LSI2008 and everything works perfectly. Before anyone chimes in suggesting card incompatibility of some sort, > I have a twin system with a LSI3008 working like a charm. ;) > > > > > > > > Borja. > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 06:35:09 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 477A1B6D2DE for ; Tue, 7 Jun 2016 06:35:09 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com [195.16.150.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 08F6612E0 for ; Tue, 7 Jun 2016 06:35:08 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 1E7E89DE019; Tue, 7 Jun 2016 08:25:19 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts From: Borja Marcos In-Reply-To: <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> Date: Tue, 7 Jun 2016 08:25:19 +0200 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> To: list-news X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 06:35:09 -0000 > On 07 Jun 2016, at 00:19, list-news = wrote: >=20 > *# sysctl vfs.zfs.vdev.bio_delete_disable=3D1* > (a few minutes later) So trim is not causing it. >=20 > *Error:* > Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 = 30 65 13 90 00 00 10 00 > Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status = Error > Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check = Condition > Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per = sense data) > ... >=20 > *And again 2 minutes later:* >=20 > Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 = 21 66 63 58 00 00 10 00 > Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status = Error > Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check = Condition > Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per = sense data) I suffered this particular symptom because, it seems of a broken LSI3008 = card. Finally I replaced it with a LSI2008 (I didn=E2=80=99t have a = spare LSI3008 handy) and the errors vanished. In my case it is a NFS storage = based on ZFS and Samsung SSD disks serving several Xen=20 hosts. In my case the disks are SATA. I know that it was a defective card and not a problem with the LSI3008 = cards or driver because I have a twin system working like a charm from day zero. I would try, if possible, to swap the controller.=20 Borja. From owner-freebsd-scsi@freebsd.org Tue Jun 7 17:09:12 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0A487B6EDC2 for ; Tue, 7 Jun 2016 17:09:12 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DE4B11FBC for ; Tue, 7 Jun 2016 17:09:11 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 38327219A73; Tue, 7 Jun 2016 12:09:10 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id t1SO_uXu9uP1; Tue, 7 Jun 2016 12:09:08 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 59F36219A69; Tue, 7 Jun 2016 12:09:08 -0500 (CDT) From: list-news Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: Borja Marcos References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> Cc: freebsd-scsi@freebsd.org Message-ID: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> Date: Tue, 7 Jun 2016 12:09:08 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 17:09:12 -0000 The system is a Twin. In the first post I mentioned this but I probably wasn't clear. The twin unit is this one: https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm I've used all components from twin node A and B (cpu / memory / mainboard / controller). I still get the errors. The backplane was the original thought of concern, and that has been RMA'd and replaced - errors continue. I've even swapped out power supplies with another identical unit I have here. In every case the errors continue, until I do this: #camcontrol daX -N 1 (for each drive in the zpool) Then the errors stop. The system errors every few minutes while my application is running. Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it finishes a batch, no errors are reported with -N set to 1. If I set tags with -N 255 for each device, errors start again within 5 minutes, and continue every 2-5 minutes, until the batch is finished. -Kyle > I would try, if possible, to swap the controller. > > > > > > > Borja. > > From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:02:30 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 47F30B6ED10 for ; Tue, 7 Jun 2016 19:02:30 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com [IPv6:2a00:1450:400c:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D92401D83 for ; Tue, 7 Jun 2016 19:02:29 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x233.google.com with SMTP id k204so81978649wmk.0 for ; Tue, 07 Jun 2016 12:02:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=5hIHKHVFrNrVSoO0S5WyI7gKopINdCIOgNvFVUELoo8=; b=GgR2wayFVgk1GC0X8Yzh/b0BpJfV88RG6dnq1mILcEsMpuHFqiEJ7H9oG+q844eAyW eVzDC88cf6K1U5Ikt+imlehkL0aHyxOR3B0ub/mlDZpvvlcy+R75T9N5xHQbmEfD3r4J WBGY070iWd+hkH5fFyKrv64+sXPGXh96CyjO1dxgz2JjdK56/4Z7Mx10QwaFSFZbO2MV cpgZNKnyl86ao7RlvrWkAfq7OmgkZ7B+silRJrMC70RQ7/OUSLtaoKVsPzuoxFl/A3f5 EBKKfGv4pnwt5UZ6T1z/3Pxrn3CyJQj0fjwXi26aYFros9umqHzzua/LtYN1bc4PSwRe S0zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=5hIHKHVFrNrVSoO0S5WyI7gKopINdCIOgNvFVUELoo8=; b=b2GUulLhPgqXzuuLsY/W8AjS/eSA400Hs83XbHkvNY6fxD/wiiY6MlTljJF7+zJgxA KEDXew3Y/46mjsrJX29qCSXQqNIna5k6qFl85DJTiwF/xn93306IhQXtEBlIjfUjSoHG CHuc/pCvG/UJdoc4AGHt5UsGMikiAsHxXtC/YuQz04072avOg17Nm4Jy/d54L9u24Nux oScXHePgqsi5HMXVi862BZdcNJmul+M5a2/bjjjMPKL48sek7PAhmhvMRPf15tpGGihO pGczbg/SRJcCff6SCeSIrkdv/bcWXEpmW0s2j3FymaVRABnqiHaqtgRB/ofB0qKnf0PG iA8w== X-Gm-Message-State: ALyK8tJoMj1GKIs6yLNDdNR8XNxs35Y+aQXSEEpOfeMlwRMNwku6na1ULw8ldhUziLXMe9vB X-Received: by 10.28.73.198 with SMTP id w189mr1164262wma.32.1465326147557; Tue, 07 Jun 2016 12:02:27 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id d7sm20832323wmd.11.2016.06.07.12.02.26 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 12:02:26 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> From: Steven Hartland Message-ID: <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> Date: Tue, 7 Jun 2016 20:02:31 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:02:30 -0000 Have you tried direct attaching the drives? On 07/06/2016 18:09, list-news wrote: > The system is a Twin. In the first post I mentioned this but I > probably wasn't clear. > > The twin unit is this one: > https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm > > I've used all components from twin node A and B (cpu / memory / > mainboard / controller). I still get the errors. The backplane was > the original thought of concern, and that has been RMA'd and replaced > - errors continue. I've even swapped out power supplies with another > identical unit I have here. > > In every case the errors continue, until I do this: > #camcontrol daX -N 1 > (for each drive in the zpool) > > Then the errors stop. > > The system errors every few minutes while my application is running. > Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and > drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it > finishes a batch, no errors are reported with -N set to 1. If I set > tags with -N 255 for each device, errors start again within 5 minutes, > and continue every 2-5 minutes, until the batch is finished. > > -Kyle > >> I would try, if possible, to swap the controller. >> >> >> >> >> >> >> Borja. >> >> > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:24:40 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 08AB6B6D148 for ; Tue, 7 Jun 2016 19:24:40 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DB11D16F5 for ; Tue, 7 Jun 2016 19:24:39 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 021FA21B7D5 for ; Tue, 7 Jun 2016 14:24:38 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id GdMerZY5S3fG for ; Tue, 7 Jun 2016 14:24:36 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 0DC2D21B7CE for ; Tue, 7 Jun 2016 14:24:36 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> From: list-news Message-ID: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> Date: Tue, 7 Jun 2016 14:24:35 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:24:40 -0000 I have additional confirmation that it's not faulty hardware. I moved the 4 disks that carry the postgresql database over to another server (same model - TWIN 2028-DECR). Mounted the zpool and fired up my application. This server is using a much earlier firmware on the SAS controller. Different CPU / Memory / etc. Errors happen within the first couple minutes, and continue every few minutes (notice time-stamps for each drive timeout every few minutes): Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 ea dc 60 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status Error Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per sense data) Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 Aborting command 0xfffffe0000be0140 Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 26 ca f0 00 00 08 00 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per sense data) Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 Aborting command 0xfffffe0000bfcca0 Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 Aborting command 0xfffffe0000c1e960 Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00 Jun 7 13:15:12 s17 kernel: 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 state c xfer (da3:0 Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 terminated ioc 804b scsi 0 sta11:te c xfer 0 Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per sense data) gstat output: (I'm guessing I caught this during the da2 error) #gstat -do L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d o/s ms/o %busy Name 70 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da10 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da11 I then set the tags down to 1 for each device: #camcontrol tags da2 -N 1 #camcontrol tags da3 -N 1 #camcontrol tags da10 -N 1 #camcontrol tags da11 -N 1 And, no errors for the last hour, system still running at full load. Everything is feeling like an NCQ firmware issue. Intel s3610 says it supports NCQ in it's SSDs with 32 tags. But I've pulled the errors with tags set to 8 plenty of times. (See NCQ line below.) # camcontrol identify da2 pass2: ACS-2 ATA SATA 3.x device pass2: 1200.000MB/s transfers, Command Queueing Enabled protocol ATA/ATAPI-9 SATA 3.x device model INTEL SSDSC2BX480G4 firmware revision G2010150 serial number [redacted] WWN [redacted] cylinders 16383 heads 16 sectors/track 63 sector size logical 512, physical 4096, offset 0 LBA supported 268435455 sectors LBA48 supported 937703088 sectors PIO supported PIO4 DMA supported WDMA2 UDMA6 media RPM non-rotating Feature Support Enabled Value Vendor read ahead yes yes write cache yes yes flush cache yes yes overlap no Tagged Command Queuing (TCQ) no no Native Command Queuing (NCQ) yes 32 tags NCQ Queue Management no NCQ Streaming no Receive & Send FPDMA Queued no SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no automatic acoustic management no no media status notification no no power-up in Standby no no write-read-verify no no unload yes yes general purpose logging yes yes free-fall no no Data Set Management (DSM/TRIM) yes DSM - max 512byte blocks yes 4 DSM - deterministic read yes zeroed Host Protected Area (HPA) yes no 937703088/937703088 HPA - Security no And it doesn't appear I have any way to deactivate it in firmware. Which would be a nice test. I did attempt this with no luck: # camcontrol negotiate da2 -T disable (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s (pass2:mpr0:0:10:0): tagged queueing: enabled camcontrol: XPT_SET_TRANS_SETTINGS CCB failed -Kyle On 6/7/16 12:09 PM, list-news wrote: > The system is a Twin. In the first post I mentioned this but I > probably wasn't clear. > > The twin unit is this one: > https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm > > I've used all components from twin node A and B (cpu / memory / > mainboard / controller). I still get the errors. The backplane was > the original thought of concern, and that has been RMA'd and replaced > - errors continue. I've even swapped out power supplies with another > identical unit I have here. > > In every case the errors continue, until I do this: > #camcontrol daX -N 1 > (for each drive in the zpool) > > Then the errors stop. > > The system errors every few minutes while my application is running. > Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and > drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it > finishes a batch, no errors are reported with -N set to 1. If I set > tags with -N 255 for each device, errors start again within 5 minutes, > and continue every 2-5 minutes, until the batch is finished. > > -Kyle > >> I would try, if possible, to swap the controller. >> >> >> >> >> >> >> Borja. >> >> > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:53:08 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8DFFEB6D843 for ; Tue, 7 Jun 2016 19:53:08 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 087DB1528 for ; Tue, 7 Jun 2016 19:53:08 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x232.google.com with SMTP id k204so83570160wmk.0 for ; Tue, 07 Jun 2016 12:53:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=OyWE/LDvMML2a8/vBR7DS72Be3W4ajYMeIMfKmubKFc=; b=qaoRJQJk4pOrwFH6RMlcq465W+/CNyRbatsuAaD2N6mN43z/pcZhXJQ0GtIm5FC95A l2arOrHD9T8JlN6MI2oB+nFOo94W3EbxP2ZjhZpaufw1LewsvRFq6H3OPC3MOUiA9ha+ FVt7552OWKtfF7TYvMzFAJnDnBZnzZQoaILFR8WmQzf1i8FGWRbQ1+y7WerAh/msB1G+ ZwFkT0PqiN9Z4ZQRNjDdlzwn8AmitDYxSjv2+5YaoA4fol7PpnBTn3gesRCTjublXD78 5AGSQjJWt+v5vQNaKeodmPWHuwc7jJN51cOnNXtvmQqRYedKOT7jzNIklhRrCoJYZ6oS qBNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=OyWE/LDvMML2a8/vBR7DS72Be3W4ajYMeIMfKmubKFc=; b=kgoRWam3kz2/jpeAhpbCccVOnbwtUh6sEwPnkI9uEVe7GqSvnTebL/CuBf50OBmREP iQbQsS3xI8AP/qDHua/BKL5sz1DJ7ybEIva8SIGHO/jGv/1+EHnvudR0IB3LzGZxgFrX LMdaxd8s+zYcKR7Mb7iT3onBO1d0J6vIAQRjDLQHLI9QWp64JMdHewbNVORu3Ue2sWVT Kc9MsLCOf7VgvropWsi9EcEIBLuPYbsBmYVZtsecBieOxFPp35yqRz0W1bJdybZ8HW31 TUCCBeceozJO7iw7OylJTcm1lhiPHkKCy3FMO6Ugtdu+6ovWA+mp5CoKD7dQqiZM/CRv 7Q3Q== X-Gm-Message-State: ALyK8tJJlke8H9vvETKopxIL3Dl+FjGE+B4tJFUeFza43ENMIGTio0ezcU6oJimIUkyNFX6P X-Received: by 10.28.26.138 with SMTP id a132mr4425191wma.82.1465329186240; Tue, 07 Jun 2016 12:53:06 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id c62sm20884456wmd.1.2016.06.07.12.53.04 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 12:53:05 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> From: Steven Hartland Message-ID: <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> Date: Tue, 7 Jun 2016 20:53:10 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:53:08 -0000 CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO delete but assuming your running ZFS can you set the following in loader.conf and see how you get on. vfs.zfs.trim.enabled=0 Regards Steve On 07/06/2016 20:24, list-news wrote: > I have additional confirmation that it's not faulty hardware. > > I moved the 4 disks that carry the postgresql database over to another > server (same model - TWIN 2028-DECR). Mounted the zpool and fired up > my application. > > This server is using a much earlier firmware on the SAS controller. > Different CPU / Memory / etc. > > Errors happen within the first couple minutes, and continue every few > minutes (notice time-stamps for each drive timeout every few minutes): > > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi > 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi > 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi > 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi > 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc > 804b scsi 0 state c xfer 0 > Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 36 ea dc 60 00 00 08 00 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command > timeout > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 > 0e 74 79 e0 00 00 08 00 > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI > Status Error > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check > Condition > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per > sense data) > Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 > Aborting command 0xfffffe0000be0140 > Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort > for target ID 10 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 > be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 > 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 > 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 > 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc > 804b scsi 0 state c xfer 0 > Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 > 26 ca f0 00 00 08 00 > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command > timeout > Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f6 ee f0 00 00 08 00 > Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status > Error > Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check > Condition > Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per > sense data) > Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 > Aborting command 0xfffffe0000bfcca0 > Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort > for target ID 10 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b > 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 > 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc > 804b scsi 0 state c xfer 0 > Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e > d6 de f0 00 00 08 00 > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command > timeout > Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): > originator(PL), code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), > code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): > originator(PL), code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): > originator(PL), code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command > Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), > code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 > Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB > request completed with an error > Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), > code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): > originator(PL), code(0x12), sub_code(0x0440) > Jun 7 13:13:05 s17 kernel: 10:0): Retrying command > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b > 8d 99 48 00 00 08 00 > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 > 6b df b8 00 00 10 00 > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f7 cd a8 00 00 08 00 > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d > f7 cd b0 00 00 08 00 > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e > d6 de f0 00 00 08 00 > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request > completed with an error > Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status > Error > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check > Condition > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted > Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack > Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 > Aborting command 0xfffffe0000c1e960 > Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort > for target ID 11 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS > THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 > length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 > 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 > bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 > 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 > state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 > 10 00 > Jun 7 13:15:12 s17 kernel: 0 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command > timeout > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 > 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 > state c xfer (da3:0 > Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE > CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 > terminated ioc 804b scsi 0 sta11:te c xfer 0 > Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command > Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status > Error > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check > Condition > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per > sense data) > > gstat output: > (I'm guessing I caught this during the da2 error) > > #gstat -do > L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps > ms/d o/s ms/o %busy Name > 70 0 0 0 0.0 0 0 0.0 0 0 > 0.0 0 0.0 0.0| da2 > 0 0 0 0 0.0 0 0 0.0 0 0 > 0.0 0 0.0 0.0| da3 > 0 0 0 0 0.0 0 0 0.0 0 0 > 0.0 0 0.0 0.0| da10 > 0 0 0 0 0.0 0 0 0.0 0 0 > 0.0 0 0.0 0.0| da11 > > > I then set the tags down to 1 for each device: > > #camcontrol tags da2 -N 1 > #camcontrol tags da3 -N 1 > #camcontrol tags da10 -N 1 > #camcontrol tags da11 -N 1 > > And, no errors for the last hour, system still running at full load. > > Everything is feeling like an NCQ firmware issue. Intel s3610 says it > supports NCQ in it's SSDs with 32 tags. But I've pulled the errors > with tags set to 8 plenty of times. > > (See NCQ line below.) > > # camcontrol identify da2 > > pass2: ACS-2 ATA SATA 3.x device > pass2: 1200.000MB/s transfers, Command Queueing Enabled > protocol ATA/ATAPI-9 SATA 3.x > device model INTEL SSDSC2BX480G4 > firmware revision G2010150 > serial number [redacted] > WWN [redacted] > cylinders 16383 > heads 16 > sectors/track 63 > sector size logical 512, physical 4096, offset 0 > LBA supported 268435455 sectors > LBA48 supported 937703088 sectors > PIO supported PIO4 > DMA supported WDMA2 UDMA6 > media RPM non-rotating > > Feature Support Enabled Value Vendor > read ahead yes yes > write cache yes yes > flush cache yes yes > overlap no > Tagged Command Queuing (TCQ) no no > Native Command Queuing (NCQ) yes 32 tags > NCQ Queue Management no > NCQ Streaming no > Receive & Send FPDMA Queued no > SMART yes yes > microcode download yes yes > security yes no > power management yes yes > advanced power management no no > automatic acoustic management no no > media status notification no no > power-up in Standby no no > write-read-verify no no > unload yes yes > general purpose logging yes yes > free-fall no no > Data Set Management (DSM/TRIM) yes > DSM - max 512byte blocks yes 4 > DSM - deterministic read yes zeroed > Host Protected Area (HPA) yes no 937703088/937703088 > HPA - Security no > > And it doesn't appear I have any way to deactivate it in firmware. > Which would be a nice test. I did attempt this with no luck: > # camcontrol negotiate da2 -T disable > (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s > (pass2:mpr0:0:10:0): tagged queueing: enabled > camcontrol: XPT_SET_TRANS_SETTINGS CCB failed > > -Kyle > > > On 6/7/16 12:09 PM, list-news wrote: >> The system is a Twin. In the first post I mentioned this but I >> probably wasn't clear. >> >> The twin unit is this one: >> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm >> >> I've used all components from twin node A and B (cpu / memory / >> mainboard / controller). I still get the errors. The backplane was >> the original thought of concern, and that has been RMA'd and replaced >> - errors continue. I've even swapped out power supplies with another >> identical unit I have here. >> >> In every case the errors continue, until I do this: >> #camcontrol daX -N 1 >> (for each drive in the zpool) >> >> Then the errors stop. >> >> The system errors every few minutes while my application is running. >> Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu >> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it >> finishes a batch, no errors are reported with -N set to 1. If I set >> tags with -N 255 for each device, errors start again within 5 >> minutes, and continue every 2-5 minutes, until the batch is finished. >> >> -Kyle >> >>> I would try, if possible, to swap the controller. >>> >>> >>> >>> >>> >>> >>> Borja. >>> >>> >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:53:27 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4468FB6D89A for ; Tue, 7 Jun 2016 19:53:27 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0DC4B167E for ; Tue, 7 Jun 2016 19:53:26 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 8E6561ED4C5 for ; Tue, 7 Jun 2016 14:53:25 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id dQxViKVTZADo for ; Tue, 7 Jun 2016 14:53:24 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 56DDA1ED4BD for ; Tue, 7 Jun 2016 14:53:24 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> From: list-news Message-ID: Date: Tue, 7 Jun 2016 14:53:23 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:53:27 -0000 I don't believe the mainboard has any SATA ports. It does have a PCIe slot IIRC though, and I may be able to rig something up with another LSI adapter I have laying around. If I can get it to fit and find a way to power the drives. Although, this seems unlikely unless you are seeing something I'm not? With that last test: If it's the SAS controller, 3 different ones running two different firmware versions are all causing the issue. If it's the backplane, I have now tested 3 of them as well, two of which I can confirm have different revision numbers. Errors never appear with tags set to 1 for each drive (effectively eliminating NCQ as I understand it). My brief understanding is that a higher tag count allows the SAS adapter to send more commands to the drive in parallel, allowing the drive to make the decisions about command ordering. If that is accurate, and the controller firmware was bad, I assume this would be a far more common bug that would have been fixed already. On the other hand, if it only happens during heavy SYNCHRONIZE CACHE commands in parallel on certain Intel SSD's and only on controllers (maybe 12gbps?) that can outrun the drive firmware or cause a race condition (my suspicions here). It seems far more likely this would have gone unnoticed by Intel. -Kyle On 6/7/16 2:02 PM, Steven Hartland wrote: > Have you tried direct attaching the drives? > > On 07/06/2016 18:09, list-news wrote: >> The system is a Twin. In the first post I mentioned this but I >> probably wasn't clear. >> >> The twin unit is this one: >> https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm >> >> I've used all components from twin node A and B (cpu / memory / >> mainboard / controller). I still get the errors. The backplane was >> the original thought of concern, and that has been RMA'd and replaced >> - errors continue. I've even swapped out power supplies with another >> identical unit I have here. >> >> In every case the errors continue, until I do this: >> #camcontrol daX -N 1 >> (for each drive in the zpool) >> >> Then the errors stop. >> >> The system errors every few minutes while my application is running. >> Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu >> and drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it >> finishes a batch, no errors are reported with -N set to 1. If I set >> tags with -N 255 for each device, errors start again within 5 >> minutes, and continue every 2-5 minutes, until the batch is finished. >> >> -Kyle >> >>> I would try, if possible, to swap the controller. >>> >>> >>> >>> >>> >>> >>> Borja. >>> >>> >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 20:19:30 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 234AAB6E246 for ; Tue, 7 Jun 2016 20:19:30 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F400B14EA for ; Tue, 7 Jun 2016 20:19:29 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 9578221A1C0 for ; Tue, 7 Jun 2016 15:19:27 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id QWKEPzpXs5o9 for ; Tue, 7 Jun 2016 15:19:26 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 4C54D21A1B9 for ; Tue, 7 Jun 2016 15:19:26 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> From: list-news Message-ID: Date: Tue, 7 Jun 2016 15:19:25 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 20:19:30 -0000 Sure Steve: # cat /boot/loader.conf | grep trim vfs.zfs.trim.enabled=0 # sysctl vfs.zfs.trim.enabled vfs.zfs.trim.enabled: 0 # uptime 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 # tail -f /var/log/messages: Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name ( ) Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 allocated tm 0xfffffe0001322150 Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001375580 Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 16 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010) Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name ( ) Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 Jun 7 15:13:50 s18 kernel: mpr0: EventReply : Jun 7 15:13:50 s18 kernel: EventDataLength: 2 Jun 7 15:13:50 s18 kernel: AckRequired: 0 Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) Jun 7 15:13:50 s18 kernel: EventContext: 0x0 Jun 7 15:13:50 s18 kernel: Flags: 1 Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 804b scsi 0 sta(da6:te c xfer 0 Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort TaskMID 1016 status 0x0 code 0x0 count 5 Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 finished recovery after aborting TaskMID 1016 Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 Jun 7 15:13:50 s18 kernel: mpr0: EventReply : Jun 7 15:13:50 s18 kernel: EventDataLength: 4 Jun 7 15:13:50 s18 kernel: AckRequired: 0 Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) Jun 7 15:13:50 s18 kernel: EventContext: 0x0 Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 Jun 7 15:13:50 s18 kernel: NumPhys: 31 Jun 7 15:13:50 s18 kernel: NumEntries: 1 Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on Event: [16] Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16] Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on Event: [1c] Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c] Jun 7 15:13:50 s18 kernel: mpr0: EventReply : Jun 7 15:13:50 s18 kernel: EventDataLength: 2 Jun 7 15:13:50 s18 kernel: AckRequired: 0 Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) Jun 7 15:13:50 s18 kernel: EventContext: 0x0 Jun 7 15:13:50 s18 kernel: Flags: 0 Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on Event: [16] Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16] Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per sense data) -Kyle On 6/7/16 2:53 PM, Steven Hartland wrote: > CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO > delete but assuming your running ZFS can you set the following in > loader.conf and see how you get on. > vfs.zfs.trim.enabled=0 > > Regards > Steve From owner-freebsd-scsi@freebsd.org Tue Jun 7 20:44:16 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3902EB6E9DA for ; Tue, 7 Jun 2016 20:44:16 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C98BD1662 for ; Tue, 7 Jun 2016 20:44:15 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x22f.google.com with SMTP id k204so85149655wmk.0 for ; Tue, 07 Jun 2016 13:44:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=VxnzCPJU+EWUN9c/LC81z/fsnyEdDuKlZlKEV9H87Zg=; b=KZz621TVswdkOVWViE/i1qs/Bb3CHyCUHhW6VCTBzKpfsS5tT9QUX1eafVthOI4a71 64zp7IX58Io0I3s4ZrJC3C8kYXejTTzCB7eWlRa/CQZRxQdMOAaEyndohOYdaCcsIjm+ KqrTeZUB7dZMxtY2VKKSQA95kHVD7lLGN2W1p/M/lL/GMX+XkdTf86nun77rqK97AxmX RAlUqeho91AjaX3x907igDP8uXy8zbY3YWfMa86NBa+x8OgjfyUlmfENk28u72x1V45w XxI0pkXL+QMXemmVx0vPLH8yzpNZCl1EJpu9I/CwObPEfoYZybFcMM07l2Jo2/kobvE5 GfZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=VxnzCPJU+EWUN9c/LC81z/fsnyEdDuKlZlKEV9H87Zg=; b=Kuyw0QYuD8M9tFb2lLSLih342MQuLtMroHjMZrn9kPE/xVR4hD+Ex7X0WK9VkPlD8C hZJAMwY8nFa4gfUNovRST/zp1xzyhPudqbOzrqK0TbwfZcs+RqdbIHHSIXnFhmm64x+Z p2dXTIXwvvtgaVTV/lGdfkD8o0UYfAXVk1fe5MyZYA9zL/1o9SEJ1wtpCP3NHFE0n+N7 y+FOf0X2MYaE3vV56JfKxrfAUTvWytcssO+yHYU3LCrBPIJe3ejxYUkZ6La8OafSWo+L JiEfApJLGaRXNJp5UvB/Z7IvjXquwAKZgX6R0QFlHIjr36R/wcGghk84vaX3q4Y1WtVl weuA== X-Gm-Message-State: ALyK8tKJlhdBYotY5jRiKon90vL385K+9oanXqSlAkFiQAhyL9jnK+oSpnxc027KrIhcVCG7 X-Received: by 10.195.9.97 with SMTP id dr1mr1148827wjd.69.1465332254196; Tue, 07 Jun 2016 13:44:14 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id kd7sm27119494wjc.33.2016.06.07.13.44.11 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 13:44:12 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <6f861c77-d9c9-9710-7be6-5b08f1047fe5@multiplay.co.uk> From: Steven Hartland Message-ID: <782184e7-0e99-63a3-8f40-8d2452d344ac@multiplay.co.uk> Date: Tue, 7 Jun 2016 21:44:17 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 20:44:16 -0000 On 07/06/2016 20:53, list-news wrote: > I don't believe the mainboard has any SATA ports. It does have a PCIe > slot IIRC though, and I may be able to rig something up with another > LSI adapter I have laying around. If I can get it to fit and find a > way to power the drives. > > Although, this seems unlikely unless you are seeing something I'm not? Nope but your assuming that the backplane doesn't have designed issue, and unfortunately that's more common than most people know so my process it to always fall back to lowest common denominator and directly attach the disks to the controller. > > With that last test: If it's the SAS controller, 3 different ones > running two different firmware versions are all causing the issue. If > it's the backplane, I have now tested 3 of them as well, two of which > I can confirm have different revision numbers. > > Errors never appear with tags set to 1 for each drive (effectively > eliminating NCQ as I understand it). My brief understanding is that a > higher tag count allows the SAS adapter to send more commands to the > drive in parallel, allowing the drive to make the decisions about > command ordering. If that is accurate, and the controller firmware > was bad, I assume this would be a far more common bug that would have > been fixed already. > > On the other hand, if it only happens during heavy SYNCHRONIZE CACHE > commands in parallel on certain Intel SSD's and only on controllers > (maybe 12gbps?) that can outrun the drive firmware or cause a race > condition (my suspicions here). It seems far more likely this would > have gone unnoticed by Intel. All possible, but discount the easy first. If you have access to 2008 based controller try that, they have always been solid here not used 3008 yet. From owner-freebsd-scsi@freebsd.org Tue Jun 7 21:22:42 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4F8A8B6E2BE for ; Tue, 7 Jun 2016 21:22:42 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x234.google.com (mail-wm0-x234.google.com [IPv6:2a00:1450:400c:c09::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1C95017BE for ; Tue, 7 Jun 2016 21:22:41 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x234.google.com with SMTP id n184so154899384wmn.1 for ; Tue, 07 Jun 2016 14:22:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=teMwACyi1sgW4KGvdaiLfbxdDrX5H/SsmN1iAphjsm4=; b=LJNiCRfsFHt9J9wg9bjIYML9Ydb4tgFeieyQHD/YWCOQHjKvbjODj7TjnZKga+GIjo 5SBd27Sg49uGXYYxRo9R5Vcn7g5BA+TvrZhYmzXb/w4phGEXWsG0RbfWNFcNMwQ+OgNk 7MnqOosy4H8lj9gmePO/XslE2GKw7TRazhHs24isbkgBquTYERmxx7/SDUxZE/rjtpDv pLBLWvpoUStKfOZJ33LsOBLlVk/js2i3Z55tNUa5IRAw0SqiDldG7wpvQp5T++8KxNsv kcFZ3PnSNdI5acLAiMptz0Nz+K5ttIGh1KN53NvUlwV9T2MxNXSyTUeU4jnHlBfOb0tU v9OA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=teMwACyi1sgW4KGvdaiLfbxdDrX5H/SsmN1iAphjsm4=; b=YFZg8j1wqlu01uVHE4FuCsKDhHFuQCItBXz9h6Uw3FzJycDz1FHsj9//ZkFf1Ng8CY cXTmMRrxJh9HYa+2AIGNd7X22Dsz6lmDWyUPb8wbTbZCR6z+FcB/75B6Vd5n94mYg6/a LTKbZPjjP6/yO8Z40ufT0gIZ6FbaDuHRH0oIJTd6KpYNt8JuEtbqJ9U9N7gw3nExiClU C8ciVFYjHTsl6SnKprhOv6CgikXwxDlPXWM6Ba0qL6c1KBGPlNJRduZ8OLn03BTeJVlf Eo2C0iBPxsdD1wbGcGAGCokIHmSUwnzRakbL2BNhMCoJ/XN4Lcr5t4tVAS8nBCSCzXhX q3BQ== X-Gm-Message-State: ALyK8tKOQDmSWikzyqk54oyJBbygwwmTPt7wZdZftHdjd238RF1smiQtw8wgTplk35GoQNni X-Received: by 10.195.9.97 with SMTP id dr1mr1255631wjd.69.1465334553598; Tue, 07 Jun 2016 14:22:33 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id o4sm27258721wjx.45.2016.06.07.14.22.31 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 14:22:31 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> From: Steven Hartland Message-ID: <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> Date: Tue, 7 Jun 2016 22:22:37 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 21:22:42 -0000 Always da6? On 07/06/2016 21:19, list-news wrote: > Sure Steve: > > # cat /boot/loader.conf | grep trim > vfs.zfs.trim.enabled=0 > > # sysctl vfs.zfs.trim.enabled > vfs.zfs.trim.enabled: 0 > > # uptime > 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 > > # tail -f /var/log/messages: > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 > 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm > 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) > Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, > connector name ( ) > Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 > allocated tm 0xfffffe0001322150 > Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 > Aborting command 0xfffffe0001375580 > Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort > for target ID 16 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout > cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010) > Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, > connector name ( ) > Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm > 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 > Jun 7 15:13:50 s18 kernel: mpr0: EventReply : > Jun 7 15:13:50 s18 kernel: EventDataLength: 2 > Jun 7 15:13:50 s18 kernel: AckRequired: 0 > Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) > Jun 7 15:13:50 s18 kernel: EventContext: 0x0 > Jun 7 15:13:50 s18 kernel: Flags: 1 > Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started > Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 > Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b > 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm > 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b > scsi 0 state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b > 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b > 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm > 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b > scsi 0 state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b > 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a > 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm > 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b > scsi 0 state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a > 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 > state c xfer 0 > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 > 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm > 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 > scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 > 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm > 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 > Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during > recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout > Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc > 804b scsi 0 sta(da6:te c xfer 0 > Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort > TaskMID 1016 status 0x0 code 0x0 count 5 > Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 finished > recovery after aborting TaskMID 1016 > Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command > Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 > Jun 7 15:13:50 s18 kernel: mpr0: EventReply : > Jun 7 15:13:50 s18 kernel: EventDataLength: 4 > Jun 7 15:13:50 s18 kernel: AckRequired: 0 > Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) > Jun 7 15:13:50 s18 kernel: EventContext: 0x0 > Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 > Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 > Jun 7 15:13:50 s18 kernel: NumPhys: 31 > Jun 7 15:13:50 s18 kernel: NumEntries: 1 > Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 > Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) > Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 > Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 > Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) > Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange > Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on > Event: [16] > Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16] > Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on > Event: [1c] > Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c] > Jun 7 15:13:50 s18 kernel: mpr0: EventReply : > Jun 7 15:13:50 s18 kernel: EventDataLength: 2 > Jun 7 15:13:50 s18 kernel: AckRequired: 0 > Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) > Jun 7 15:13:50 s18 kernel: EventContext: 0x0 > Jun 7 15:13:50 s18 kernel: Flags: 0 > Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete > Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 > Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 > Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on > Event: [16] > Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16] > Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). > CDB: 35 00 00 00 00 00 00 00 00 00 > Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status > Error > Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check > Condition > Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT > ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) > Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per > sense data) > > -Kyle > > On 6/7/16 2:53 PM, Steven Hartland wrote: >> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO >> delete but assuming your running ZFS can you set the following in >> loader.conf and see how you get on. >> vfs.zfs.trim.enabled=0 >> >> Regards >> Steve > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 22:43:21 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27494B6D741 for ; Tue, 7 Jun 2016 22:43:21 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E938C1BB9 for ; Tue, 7 Jun 2016 22:43:19 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 014061ED6B9 for ; Tue, 7 Jun 2016 17:43:13 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id 6iyAs-CUw2iI for ; Tue, 7 Jun 2016 17:43:11 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 2C2091ED6AE for ; Tue, 7 Jun 2016 17:43:11 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> From: list-news Message-ID: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> Date: Tue, 7 Jun 2016 17:43:10 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 22:43:21 -0000 No, it threw errors on both da6 and da7 and then I stopped it. Your last e-mail gave me thoughts though. I have a server with 2008 controllers (entirely different backplane design, cpu, memory, etc). I've moved the 4 drives to that and I'm running the test now. # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 # sysctl dev.mps.0 dev.mps.0.spinup_wait_time: 3 dev.mps.0.chain_alloc_fail: 0 dev.mps.0.enable_ssu: 1 dev.mps.0.max_chains: 2048 dev.mps.0.chain_free_lowwater: 1176 dev.mps.0.chain_free: 2048 dev.mps.0.io_cmds_highwater: 510 dev.mps.0.io_cmds_active: 0 dev.mps.0.driver_version: 20.00.00.00-fbsd dev.mps.0.firmware_version: 17.00.01.00 dev.mps.0.disable_msi: 0 dev.mps.0.disable_msix: 0 dev.mps.0.debug_level: 3 dev.mps.0.%parent: pci5 dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 subdevice=0x3020 class=0x010700 dev.mps.0.%location: slot=0 function=0 dev.mps.0.%driver: mps dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 About 1.5 hours has passed at full load, no errors. gstat drive busy% seems to hang out around 30-40 instead of ~60-70. Overall throughput seems to be 20-30% less with my rough benchmarks. I'm not sure if this gets us closer to the answer, if this doesn't time-out on the 2008 controller, it looks like one of these: 1) The Intel drive firmware is being overloaded somehow when connected to the 3008. or 2) The 3008 firmware or driver has an issue reading drive responses, sporadically thinking the command timed-out (when maybe it really didn't). Puzzle pieces: A) Why does setting tags of 1 on drives connected to the 3008 fix the problem? B) With tags of 255. Why does postgres (and assuming a large fsync count), seem to cause the problem within minutes? While running other heavy i/o commands (zpool scrub, bonnie++, fio), all of which show similarly high or higher iops take hours to cause the problem (if ever). I'll let this continue to run to further test. Thanks again for all the help. -Kyle On 6/7/16 4:22 PM, Steven Hartland wrote: > Always da6? > > On 07/06/2016 21:19, list-news wrote: >> Sure Steve: >> >> # cat /boot/loader.conf | grep trim >> vfs.zfs.trim.enabled=0 >> >> # sysctl vfs.zfs.trim.enabled >> vfs.zfs.trim.enabled: 0 >> >> # uptime >> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >> >> # tail -f /var/log/messages: >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm >> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >> connector name ( ) >> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 >> allocated tm 0xfffffe0001322150 >> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 >> Aborting command 0xfffffe0001375580 >> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from >> mprsas_send_abort for target ID 16 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). >> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout >> cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010) >> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >> connector name ( ) >> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm >> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >> Jun 7 15:13:50 s18 kernel: Flags: 1 >> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm >> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b >> scsi 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi >> 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm >> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b >> scsi 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi >> 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm >> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b >> scsi 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi >> 0 state c xfer 0 >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm >> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 >> scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 >> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm >> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during >> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout >> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). >> CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc >> 804b scsi 0 sta(da6:te c xfer 0 >> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort >> TaskMID 1016 status 0x0 code 0x0 count 5 >> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 >> finished recovery after aborting TaskMID 1016 >> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on >> Event: [16] >> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16] >> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on >> Event: [1c] >> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c] >> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >> Jun 7 15:13:50 s18 kernel: Flags: 0 >> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on >> Event: [16] >> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16] >> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). >> CDB: 35 00 00 00 00 00 00 00 00 00 >> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI >> Status Error >> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check >> Condition >> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT >> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per >> sense data) >> >> -Kyle >> >> On 6/7/16 2:53 PM, Steven Hartland wrote: >>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO >>> delete but assuming your running ZFS can you set the following in >>> loader.conf and see how you get on. >>> vfs.zfs.trim.enabled=0 >>> >>> Regards >>> Steve >> >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 23:28:37 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0E6B2B6DEE3 for ; Tue, 7 Jun 2016 23:28:37 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x231.google.com (mail-wm0-x231.google.com [IPv6:2a00:1450:400c:c09::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9AEDF1C3F for ; Tue, 7 Jun 2016 23:28:36 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x231.google.com with SMTP id k204so89100473wmk.0 for ; Tue, 07 Jun 2016 16:28:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=tnFN9BhE4FbQKpErinXYONtIG6CCIoCOSzMUFwLt9Ws=; b=YWoqU0qk7MNPVPY1UvaNEC8GBhkfgcOLWxp4OHHqyGspro+A8cvJAeEIGo4aTJkyjR OdX2Q4/FwoUzn5cycizRKNQZn6qQ/+pltWK3nrXXjxWjWgmwrLNs/sfyrryxkWkrjU8A RnOoGY78B3mVOaa2/a5VhFe1qz5UHq2AdSxSHqQFH18yHNsWisKoLQBdPwyNm7tAlpjI w6Nm2MXaNukfUPZFdKKbzz7z6lRLRHZvrwBOJlllSM24nkfhx9ncgU2ZsRbgaewe0W3g 4AOQTfoYU38rMlayNt81xq8tcGbuS1lXkzSUDpxbWnQu70UBLm61cc4ErWRdPAgVGyk2 KPBg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=tnFN9BhE4FbQKpErinXYONtIG6CCIoCOSzMUFwLt9Ws=; b=V1Ot6IpBzDFJsqo8mze4QzwiGb3i7JEVFYtOWORPXRafWBSYZ8IzqFVyyy7nhne35E w6pKzmUG0YY+DmuZu8uYPYrTjHdB9Hs/krI1X4ivqAcpitNB7//DWoAT2a7/m6ghvm31 F7cG0HdCBJJwr5Q18sNUjOp0MKdHHS8grvgaboFS1FKsp+EIWaaQcnl+rj9jv0CO2wV4 ZO44BQ+PiIEI09qP43hRBYdlYvF0OLJNeDU5QbuQMTtDmI+NXqo03yzVICVy1mYdccjP RfRhX/1EycrRifmo4rMh2zmzUv8MOMsRHnCPrjlEYSh5vcbXkh0OJgGtpRAcSlmqOSep MiYQ== X-Gm-Message-State: ALyK8tKo9+yR8ga10ypevPmevWO2irjWT0B2YeulMdLPxQCONIV7gJ17K63FY4Jmz8dJggjv X-Received: by 10.28.132.144 with SMTP id g138mr4836615wmd.47.1465342113847; Tue, 07 Jun 2016 16:28:33 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id d195sm21730589wmd.12.2016.06.07.16.28.32 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 16:28:32 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> From: Steven Hartland Message-ID: <73dd23bd-7989-6dde-f3ff-e6e51610390a@multiplay.co.uk> Date: Wed, 8 Jun 2016 00:28:38 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 23:28:37 -0000 If that works I'd switch the 3008 into the machine with 2008 in currently and retest. That will help to confirm the 3008 card and driver is or isn't a potential issue. On 07/06/2016 23:43, list-news wrote: > No, it threw errors on both da6 and da7 and then I stopped it. > > Your last e-mail gave me thoughts though. I have a server with 2008 > controllers (entirely different backplane design, cpu, memory, etc). > I've moved the 4 drives to that and I'm running the test now. > > # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 > # sysctl dev.mps.0 > dev.mps.0.spinup_wait_time: 3 > dev.mps.0.chain_alloc_fail: 0 > dev.mps.0.enable_ssu: 1 > dev.mps.0.max_chains: 2048 > dev.mps.0.chain_free_lowwater: 1176 > dev.mps.0.chain_free: 2048 > dev.mps.0.io_cmds_highwater: 510 > dev.mps.0.io_cmds_active: 0 > dev.mps.0.driver_version: 20.00.00.00-fbsd > dev.mps.0.firmware_version: 17.00.01.00 > dev.mps.0.disable_msi: 0 > dev.mps.0.disable_msix: 0 > dev.mps.0.debug_level: 3 > dev.mps.0.%parent: pci5 > dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 > subdevice=0x3020 class=0x010700 > dev.mps.0.%location: slot=0 function=0 > dev.mps.0.%driver: mps > dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 > > About 1.5 hours has passed at full load, no errors. > > gstat drive busy% seems to hang out around 30-40 instead of ~60-70. > Overall throughput seems to be 20-30% less with my rough benchmarks. > > I'm not sure if this gets us closer to the answer, if this doesn't > time-out on the 2008 controller, it looks like one of these: > 1) The Intel drive firmware is being overloaded somehow when connected > to the 3008. > or > 2) The 3008 firmware or driver has an issue reading drive responses, > sporadically thinking the command timed-out (when maybe it really > didn't). > > Puzzle pieces: > A) Why does setting tags of 1 on drives connected to the 3008 fix the > problem? > B) With tags of 255. Why does postgres (and assuming a large fsync > count), seem to cause the problem within minutes? While running other > heavy i/o commands (zpool scrub, bonnie++, fio), all of which show > similarly high or higher iops take hours to cause the problem (if ever). > > I'll let this continue to run to further test. > > Thanks again for all the help. From owner-freebsd-scsi@freebsd.org Tue Jun 7 23:30:17 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DEAFFB6DF5E for ; Tue, 7 Jun 2016 23:30:17 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x22c.google.com (mail-wm0-x22c.google.com [IPv6:2a00:1450:400c:c09::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 60CC11CA2 for ; Tue, 7 Jun 2016 23:30:17 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x22c.google.com with SMTP id v199so40150582wmv.0 for ; Tue, 07 Jun 2016 16:30:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=; b=mwFWGq6LF1yQl3jaiYSEVO3msN5OeZAZTkhrRiDY9ufKoJY0Y2GgSUXg6xliw5xg0N 5413v2GmCjt8+wUMgBWyy3+WhMIpH/05zVo80z8OmpzDiclm16IpO9sYVspvj5JrActo K/HdTE6uwtfX+FMOekoUJXXe8QMfP1m8vLylbtClaaIyshtUKBmPv5kKLi7Z78Uzqmti orHdB030+tkQG0sruvd3gjcBFX5g62LOenMBtpyRDpqM8HilHHAHkBo9c8i3G6uv/Iqu jqCiX/cQfyWbQxI2ECj7bDaQ2xRQjdXv+1VFyE02OuNkCmd9C2hOhzVpGRkFQ1ABLDoo cFaw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=1TfW7jFNyOrL2O7rvTiGr+aqoGaypK6Tz4q1/+c1r2w=; b=HZtbFtC0yYzgCvD/Heu8TylN0EV8fpRG+NPBDpPx7J1Gluq+C6XRpvswpCxIRYNIiD X9AwvhmjogVAWURUlj4xEWTRaYWBN4vhQcEmG0jekPRExTfHMJ5XFNHAcCYunNhqm6Fd 4dXlNwI8sQqWUQBPQKpzB5VBJB4/yvsav2FPpnbcm/fVULuerkWFbNghwChlLkLR03aE v+ax9ornPICEFNH0PQ9S0XtZtD+7DWo/tGmgUx6kifbYw5DPexe/eOYORygbjwWvF4qh H5Bj4Tk4g5P2XlTm1MHcuTR6Xm1CdsKq6tSaQqkqYHZ/YyMJg3YsGY4fuo+thHzo7YJ9 isCg== X-Gm-Message-State: ALyK8tLAsBGlvMGzqvBbrEXdUjCU5S+Pfe+V9HaWQlS6PzbXcQ1o3c4oWAMXOhjcLHCo2xig X-Received: by 10.194.123.9 with SMTP id lw9mr1713992wjb.53.1465342215759; Tue, 07 Jun 2016 16:30:15 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id o76sm21768597wme.0.2016.06.07.16.30.14 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 07 Jun 2016 16:30:14 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> From: Steven Hartland Message-ID: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> Date: Wed, 8 Jun 2016 00:30:19 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 23:30:18 -0000 Oh another thing to test is iirc 3008 is supported by mrsas so you might want to try adding the following into loader.conf to switch drivers: hw.mfi.mrsas_enable="1" On 07/06/2016 23:43, list-news wrote: > No, it threw errors on both da6 and da7 and then I stopped it. > > Your last e-mail gave me thoughts though. I have a server with 2008 > controllers (entirely different backplane design, cpu, memory, etc). > I've moved the 4 drives to that and I'm running the test now. > > # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 > # sysctl dev.mps.0 > dev.mps.0.spinup_wait_time: 3 > dev.mps.0.chain_alloc_fail: 0 > dev.mps.0.enable_ssu: 1 > dev.mps.0.max_chains: 2048 > dev.mps.0.chain_free_lowwater: 1176 > dev.mps.0.chain_free: 2048 > dev.mps.0.io_cmds_highwater: 510 > dev.mps.0.io_cmds_active: 0 > dev.mps.0.driver_version: 20.00.00.00-fbsd > dev.mps.0.firmware_version: 17.00.01.00 > dev.mps.0.disable_msi: 0 > dev.mps.0.disable_msix: 0 > dev.mps.0.debug_level: 3 > dev.mps.0.%parent: pci5 > dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 > subdevice=0x3020 class=0x010700 > dev.mps.0.%location: slot=0 function=0 > dev.mps.0.%driver: mps > dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 > > About 1.5 hours has passed at full load, no errors. > > gstat drive busy% seems to hang out around 30-40 instead of ~60-70. > Overall throughput seems to be 20-30% less with my rough benchmarks. > > I'm not sure if this gets us closer to the answer, if this doesn't > time-out on the 2008 controller, it looks like one of these: > 1) The Intel drive firmware is being overloaded somehow when connected > to the 3008. > or > 2) The 3008 firmware or driver has an issue reading drive responses, > sporadically thinking the command timed-out (when maybe it really > didn't). > > Puzzle pieces: > A) Why does setting tags of 1 on drives connected to the 3008 fix the > problem? > B) With tags of 255. Why does postgres (and assuming a large fsync > count), seem to cause the problem within minutes? While running other > heavy i/o commands (zpool scrub, bonnie++, fio), all of which show > similarly high or higher iops take hours to cause the problem (if ever). > > I'll let this continue to run to further test. > > Thanks again for all the help. > > -Kyle > > On 6/7/16 4:22 PM, Steven Hartland wrote: >> Always da6? >> >> On 07/06/2016 21:19, list-news wrote: >>> Sure Steve: >>> >>> # cat /boot/loader.conf | grep trim >>> vfs.zfs.trim.enabled=0 >>> >>> # sysctl vfs.zfs.trim.enabled >>> vfs.zfs.trim.enabled: 0 >>> >>> # uptime >>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>> >>> # tail -f /var/log/messages: >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm >>> 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>> connector name ( ) >>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 >>> allocated tm 0xfffffe0001322150 >>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 >>> Aborting command 0xfffffe0001375580 >>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from >>> mprsas_send_abort for target ID 16 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target >>> 16, handle(0x0010) >>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>> connector name ( ) >>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm >>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm >>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm >>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm >>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 >>> 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b >>> scsi 0 state c xfer 0 >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 >>> 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm >>> 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 >>> scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 >>> 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm >>> 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 >>> b0 00 >>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during >>> recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout >>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort >>> TaskMID 1016 status 0x0 code 0x0 count 5 >>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 >>> finished recovery after aborting TaskMID 1016 >>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on >>> Event: [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: >>> [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on >>> Event: [1c] >>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: >>> [1c] >>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on >>> Event: [16] >>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: >>> [16] >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI >>> Status Error >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check >>> Condition >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT >>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per >>> sense data) >>> >>> -Kyle >>> >>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using >>>> BIO delete but assuming your running ZFS can you set the following >>>> in loader.conf and see how you get on. >>>> vfs.zfs.trim.enabled=0 >>>> >>>> Regards >>>> Steve >>> >>> >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Tue Jun 7 23:39:45 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CB703B6E239 for ; Tue, 7 Jun 2016 23:39:45 +0000 (UTC) (envelope-from david@gwynne.id.au) Received: from mail-pf0-x22f.google.com (mail-pf0-x22f.google.com [IPv6:2607:f8b0:400e:c00::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A102C11E5 for ; Tue, 7 Jun 2016 23:39:45 +0000 (UTC) (envelope-from david@gwynne.id.au) Received: by mail-pf0-x22f.google.com with SMTP id 62so82376782pfd.1 for ; Tue, 07 Jun 2016 16:39:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gwynne-id-au.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=rk2IlZGiblIVZdL+OB/GGS15X2/fOyMyQ88p45n9X1g=; b=BjDuoJ+r+OGah2hKqNbwxOCS0FKsHmfHKrNFC7E+fJpxPkLljZmNZul2ADXGGPdazK tPGrnDQ6oP4DZmx0so3Sm0F5u3FVMganJX22vV95lkORgn25SFmaRzLdQyrAOoFhRL4o AEZSEahq2GdzmatQkowHa1DV6uWwLIoXLvH4lj4eaLMnZRWm30Y1D/Jos2r00k8VN6XV mLHV+vHgx4xKNy1mSpolCkECzxdLU35UxwFJCHR3YgD1XK68bBBd0VXb8juWt71xAcxh kG5SqCAK81GSoOLrlQzEMQOBhL7Ki4Ad6vxv0kBdfQrJwOVeTiScrJ752zw8hbkO0cOe CLBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=rk2IlZGiblIVZdL+OB/GGS15X2/fOyMyQ88p45n9X1g=; b=fUfxmwZ05Aple/EgmHkxDAJgC0Sx3GZDkgGhimHmKFQHD9Bqbxf4NRHfexMulRt8Fl OZuDx/+DPKqBiQcYxiPpHm64zzvkbGJKeUt787zxPuuYcCv83fz5jnW0BSyAcg6s/e06 h085kK7sXqkXIDD4HY/YlWLshhqtPOmrvoMCmIiviOEbJEqLckvJxuj5YncN2DZF2rsT R1ypIYupii7QT8cnINL3L7wT9YKk+3icVsOCEfTpkOlQJe141PDE/CV7DjCmGtPWK5bG /fgsj7dZIHKFh5vej1A0LNBQl9vkKPo7jQFnLBklh44KnkYYMLHx51r6SYf4I+nRcfs7 v+lA== X-Gm-Message-State: ALyK8tLJDWZzlkEA/pxwzsdASEJC+Uc4Eavflo46O8HnS2ZdeJgXf/A63dAdleX5bcnwuw== X-Received: by 10.98.58.77 with SMTP id h74mr2154346pfa.156.1465342784506; Tue, 07 Jun 2016 16:39:44 -0700 (PDT) Received: from opiate.eait.uq.edu.au (a82-177.nat.uq.edu.au. [130.102.82.177]) by smtp.gmail.com with ESMTPSA id 129sm11832387pfe.3.2016.06.07.16.39.41 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 07 Jun 2016 16:39:43 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts From: David Gwynne In-Reply-To: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> Date: Wed, 8 Jun 2016 09:39:38 +1000 Cc: freebsd-scsi@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> To: Steven Hartland X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 23:39:45 -0000 > On 8 Jun 2016, at 09:30, Steven Hartland = wrote: >=20 > Oh another thing to test is iirc 3008 is supported by mrsas so you = might want to try adding the following into loader.conf to switch = drivers: > hw.mfi.mrsas_enable=3D"1" i believe the 3008s can run two different firmwares, one that provides = the mpt2 interface and the other than provides the megaraid sas fusion = interface. you have to flash them to switch though, you cant just point = a driver at it and hope for the best. each fw presents different pci ids. eg, in = http://pciids.sourceforge.net/v2.2/pci.ids you can see: 005f MegaRAID SAS-3 3008 [Fury] 0097 SAS3008 PCI-Express Fusion-MPT SAS-3 dlg >=20 > On 07/06/2016 23:43, list-news wrote: >> No, it threw errors on both da6 and da7 and then I stopped it. >>=20 >> Your last e-mail gave me thoughts though. I have a server with 2008 = controllers (entirely different backplane design, cpu, memory, etc). = I've moved the 4 drives to that and I'm running the test now. >>=20 >> # uname =3D FreeBSD 10.2-RELEASE-p12 #1 r296215 >> # sysctl dev.mps.0 >> dev.mps.0.spinup_wait_time: 3 >> dev.mps.0.chain_alloc_fail: 0 >> dev.mps.0.enable_ssu: 1 >> dev.mps.0.max_chains: 2048 >> dev.mps.0.chain_free_lowwater: 1176 >> dev.mps.0.chain_free: 2048 >> dev.mps.0.io_cmds_highwater: 510 >> dev.mps.0.io_cmds_active: 0 >> dev.mps.0.driver_version: 20.00.00.00-fbsd >> dev.mps.0.firmware_version: 17.00.01.00 >> dev.mps.0.disable_msi: 0 >> dev.mps.0.disable_msix: 0 >> dev.mps.0.debug_level: 3 >> dev.mps.0.%parent: pci5 >> dev.mps.0.%pnpinfo: vendor=3D0x1000 device=3D0x0072 subvendor=3D0x1000 = subdevice=3D0x3020 class=3D0x010700 >> dev.mps.0.%location: slot=3D0 function=3D0 >> dev.mps.0.%driver: mps >> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 >>=20 >> About 1.5 hours has passed at full load, no errors. >>=20 >> gstat drive busy% seems to hang out around 30-40 instead of ~60-70. = Overall throughput seems to be 20-30% less with my rough benchmarks. >>=20 >> I'm not sure if this gets us closer to the answer, if this doesn't = time-out on the 2008 controller, it looks like one of these: >> 1) The Intel drive firmware is being overloaded somehow when = connected to the 3008. >> or >> 2) The 3008 firmware or driver has an issue reading drive responses, = sporadically thinking the command timed-out (when maybe it really = didn't). >>=20 >> Puzzle pieces: >> A) Why does setting tags of 1 on drives connected to the 3008 fix the = problem? >> B) With tags of 255. Why does postgres (and assuming a large fsync = count), seem to cause the problem within minutes? While running other = heavy i/o commands (zpool scrub, bonnie++, fio), all of which show = similarly high or higher iops take hours to cause the problem (if ever). >>=20 >> I'll let this continue to run to further test. >>=20 >> Thanks again for all the help. >>=20 >> -Kyle >>=20 >> On 6/7/16 4:22 PM, Steven Hartland wrote: >>> Always da6? >>>=20 >>> On 07/06/2016 21:19, list-news wrote: >>>> Sure Steve: >>>>=20 >>>> # cat /boot/loader.conf | grep trim >>>> vfs.zfs.trim.enabled=3D0 >>>>=20 >>>> # sysctl vfs.zfs.trim.enabled >>>> vfs.zfs.trim.enabled: 0 >>>>=20 >>>> # uptime >>>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>>>=20 >>>> # tail -f /var/log/messages: >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a = 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm = 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, = connector name ( ) >>>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 = allocated tm 0xfffffe0001322150 >>>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 = Aborting command 0xfffffe0001375580 >>>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from = mprsas_send_abort for target ID 16 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command = timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, = handle(0x0010) >>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, = connector name ( ) >>>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm = 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm = 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm = 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm = 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a = 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm = 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi = 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 = 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm = 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >>>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 = during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command = timeout >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 = terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 = abort TaskMID 1016 status 0x0 code 0x0 count 5 >>>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 = finished recovery after aborting TaskMID 1016 >>>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on = Event: [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: = [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on = Event: [1c] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: = [1c] >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on = Event: [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: = [16] >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI = Status Error >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check = Condition >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command = (per sense data) >>>>=20 >>>> -Kyle >>>>=20 >>>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using = BIO delete but assuming your running ZFS can you set the following in = loader.conf and see how you get on. >>>>> vfs.zfs.trim.enabled=3D0 >>>>>=20 >>>>> Regards >>>>> Steve >>>>=20 >>>>=20 >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>>=20 >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>=20 >>=20 >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Fri Jun 10 09:33:24 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0D8D7B703B6 for ; Fri, 10 Jun 2016 09:33:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F21BC1A7B for ; Fri, 10 Jun 2016 09:33:23 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5A9XNsJ066035 for ; Fri, 10 Jun 2016 09:33:23 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 202625] [cam][libcam][patch] PERSISTENT RESERVE OUT needs scsi_cmd->length to be populated Date: Fri, 10 Jun 2016 09:33:23 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-RELEASE X-Bugzilla-Keywords: patch X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: andrew.hotlab@hotmail.com X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jun 2016 09:33:24 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D202625 Andrew changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |freebsd-scsi@FreeBSD.org --- Comment #2 from Andrew --- Adding the freebsd-scsi list to the discussion, hoping that a committer cou= ld notice it and commit this patch. Thanks! --Andrew --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Fri Jun 10 16:36:34 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 13034AD92C7 for ; Fri, 10 Jun 2016 16:36:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0372B2F6A for ; Fri, 10 Jun 2016 16:36:34 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u5AGaXkl095038 for ; Fri, 10 Jun 2016 16:36:33 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 202625] [cam][libcam][patch] PERSISTENT RESERVE OUT needs scsi_cmd->length to be populated Date: Fri, 10 Jun 2016 16:36:34 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-RELEASE X-Bugzilla-Keywords: patch X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: asomers@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: ken@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Jun 2016 16:36:34 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D202625 Alan Somers changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |asomers@FreeBSD.org Assignee|freebsd-bugs@FreeBSD.org |ken@FreeBSD.org --=20 You are receiving this mail because: You are on the CC list for the bug.=