Date: Tue, 7 Jun 2016 14:24:35 -0500 From: list-news <list-news@mindpackstudios.com> To: freebsd-scsi@freebsd.org Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts Message-ID: <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> In-Reply-To: <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I have additional confirmation that it's not faulty hardware. I moved the 4 disks that carry the postgresql database over to another server (same model - TWIN 2028-DECR). Mounted the zpool and fired up my application. This server is using a much earlier firmware on the SAS controller. Different CPU / Memory / etc. Errors happen within the first couple minutes, and continue every few minutes (notice time-stamps for each drive timeout every few minutes): Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33 91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36 ea dc 60 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e 74 79 e0 00 00 08 00 Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status Error Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per sense data) Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4 Aborting command 0xfffffe0000be0140 Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19 be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21 3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35 26 ca f0 00 00 08 00 Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f6 ee f0 00 00 08 00 Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per sense data) Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5 Aborting command 0xfffffe0000bfcca0 Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440): originator(PL), code(0x12), sub_code(0x0440) Jun 7 13:13:05 s17 kernel: 10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b 8d 99 48 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13 6b df b8 00 00 10 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd a8 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d f7 cd b0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e d6 de f0 00 00 08 00 Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request completed with an error Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6 Aborting command 0xfffffe0000c1e960 Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length 512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23 7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31 bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0 state c xfer 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0 state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00 Jun 7 13:15:12 s17 kernel: 0 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19 7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0 state c xfer (da3:0 Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441 terminated ioc 804b scsi 0 sta11:te c xfer 0 Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per sense data) gstat output: (I'm guessing I caught this during the da2 error) #gstat -do L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps ms/d o/s ms/o %busy Name 70 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da2 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da3 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da10 0 0 0 0 0.0 0 0 0.0 0 0 0.0 0 0.0 0.0| da11 I then set the tags down to 1 for each device: #camcontrol tags da2 -N 1 #camcontrol tags da3 -N 1 #camcontrol tags da10 -N 1 #camcontrol tags da11 -N 1 And, no errors for the last hour, system still running at full load. Everything is feeling like an NCQ firmware issue. Intel s3610 says it supports NCQ in it's SSDs with 32 tags. But I've pulled the errors with tags set to 8 plenty of times. (See NCQ line below.) # camcontrol identify da2 pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device pass2: 1200.000MB/s transfers, Command Queueing Enabled protocol ATA/ATAPI-9 SATA 3.x device model INTEL SSDSC2BX480G4 firmware revision G2010150 serial number [redacted] WWN [redacted] cylinders 16383 heads 16 sectors/track 63 sector size logical 512, physical 4096, offset 0 LBA supported 268435455 sectors LBA48 supported 937703088 sectors PIO supported PIO4 DMA supported WDMA2 UDMA6 media RPM non-rotating Feature Support Enabled Value Vendor read ahead yes yes write cache yes yes flush cache yes yes overlap no Tagged Command Queuing (TCQ) no no Native Command Queuing (NCQ) yes 32 tags NCQ Queue Management no NCQ Streaming no Receive & Send FPDMA Queued no SMART yes yes microcode download yes yes security yes no power management yes yes advanced power management no no automatic acoustic management no no media status notification no no power-up in Standby no no write-read-verify no no unload yes yes general purpose logging yes yes free-fall no no Data Set Management (DSM/TRIM) yes DSM - max 512byte blocks yes 4 DSM - deterministic read yes zeroed Host Protected Area (HPA) yes no 937703088/937703088 HPA - Security no And it doesn't appear I have any way to deactivate it in firmware. Which would be a nice test. I did attempt this with no luck: # camcontrol negotiate da2 -T disable (pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s (pass2:mpr0:0:10:0): tagged queueing: enabled camcontrol: XPT_SET_TRANS_SETTINGS CCB failed -Kyle On 6/7/16 12:09 PM, list-news wrote: > The system is a Twin. In the first post I mentioned this but I > probably wasn't clear. > > The twin unit is this one: > https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm > > I've used all components from twin node A and B (cpu / memory / > mainboard / controller). I still get the errors. The backplane was > the original thought of concern, and that has been RMA'd and replaced > - errors continue. I've even swapped out power supplies with another > identical unit I have here. > > In every case the errors continue, until I do this: > #camcontrol daX -N 1 > (for each drive in the zpool) > > Then the errors stop. > > The system errors every few minutes while my application is running. > Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and > drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it > finishes a batch, no errors are reported with -N set to 1. If I set > tags with -N 255 for each device, errors start again within 5 minutes, > and continue every 2-5 minutes, until the batch is finished. > > -Kyle > >> I would try, if possible, to swap the controller. >> >> >> >> >> >> >> Borja. >> >> > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?583dddc6-4614-9900-88f7-27347866d7aa>