From owner-freebsd-scsi@freebsd.org Tue Jun 7 19:24:40 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 08AB6B6D148 for ; Tue, 7 Jun 2016 19:24:40 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DB11D16F5 for ; Tue, 7 Jun 2016 19:24:39 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 021FA21B7D5 for ; Tue, 7 Jun 2016 14:24:38 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id GdMerZY5S3fG for ; Tue, 7 Jun 2016 14:24:36 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 0DC2D21B7CE for ; Tue, 7 Jun 2016 14:24:36 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> From: list-news Message-ID: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> Date: Tue, 7 Jun 2016 14:24:35 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.1 MIME-Version: 1.0 In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 07 Jun 2016 19:24:40 -0000 I have additional confirmation that it's not faulty hardware. I moved the 4 disks that carry the postgresql database over to another server (same model - TWIN 2028-DECR). Mounted the zpool and fired up my application. This server is using a much earlier firmware on the SAS controller. Different CPU / Memory / etc. Errors happen within the first couple minutes, and continue every few minutes (notice time-stamps for each drive timeout every few minutes): Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 ea dc 60 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status Error Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per sense data) Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 Aborting command 0xfffffe0000be0140 Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 26 ca f0 00 00 08 00 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per sense data) Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 Aborting command 0xfffffe0000bfcca0 Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 Aborting command 0xfffffe0000c1e960 Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00 Jun 7 13:15:12 s17 kernel: 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 state c xfer (da3:0 Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 terminated ioc 804b scsi 0 sta11:te c xfer 0 Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per sense data) gstat output: (I'm guessing I caught this during the da2 error) #gstat -do L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d o/s ms/o %busy Name 70 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da10 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da11 I then set the tags down to 1 for each device: #camcontrol tags da2 -N 1 #camcontrol tags da3 -N 1 #camcontrol tags da10 -N 1 #camcontrol tags da11 -N 1 And, no errors for the last hour, system still running at full load. Everything is feeling like an NCQ firmware issue. Intel s3610 says it supports NCQ in it's SSDs with 32 tags. But I've pulled the errors with tags set to 8 plenty of times. (See NCQ line below.) # camcontrol identify da2 pass2: ACS-2 ATA SATA 3.x device pass2: 1200.000MB/s transfers, Command Queueing Enabled protocol ATA/ATAPI-9 SATA 3.x device model INTEL SSDSC2BX480G4 firmware revision G2010150 serial number [redacted] WWN [redacted] cylinders 16383 heads 16 sectors/track 63 sector size logical 512, physical 4096, offset 0 LBA supported 268435455 sectors LBA48 supported 937703088 sectors PIO supported PIO4 DMA supported WDMA2 UDMA6 media RPM non-rotating Feature Support Enabled Value Vendor read ahead yes yes write cache yes yes flush cache yes yes overlap no Tagged Command Queuing (TCQ) no no Native Command Queuing (NCQ) yes 32 tags NCQ Queue Management no NCQ Streaming no Receive & Send FPDMA Queued no SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no automatic acoustic management no no media status notification no no power-up in Standby no no write-read-verify no no unload yes yes general purpose logging yes yes free-fall no no Data Set Management (DSM/TRIM) yes DSM - max 512byte blocks yes 4 DSM - deterministic read yes zeroed Host Protected Area (HPA) yes no 937703088/937703088 HPA - Security no And it doesn't appear I have any way to deactivate it in firmware. Which would be a nice test. I did attempt this with no luck: # camcontrol negotiate da2 -T disable (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s (pass2:mpr0:0:10:0): tagged queueing: enabled camcontrol: XPT_SET_TRANS_SETTINGS CCB failed -Kyle On 6/7/16 12:09 PM, list-news wrote: > The system is a Twin. In the first post I mentioned this but I > probably wasn't clear. > > The twin unit is this one: > https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm > > I've used all components from twin node A and B (cpu / memory / > mainboard / controller). I still get the errors. The backplane was > the original thought of concern, and that has been RMA'd and replaced > - errors continue. I've even swapped out power supplies with another > identical unit I have here. > > In every case the errors continue, until I do this: > #camcontrol daX -N 1 > (for each drive in the zpool) > > Then the errors stop. > > The system errors every few minutes while my application is running. > Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and > drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it > finishes a batch, no errors are reported with -N set to 1. If I set > tags with -N 255 for each device, errors start again within 5 minutes, > and continue every 2-5 minutes, until the batch is finished. > > -Kyle > >> I would try, if possible, to swap the controller. >> >> >> >> >> >> >> Borja. >> >> > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"