Date: Mon, 13 Jul 2015 10:13:32 +0100 From: Steven Hartland <killing@multiplay.co.uk> To: freebsd-scsi@freebsd.org Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) Message-ID: <55A3813C.7010002@multiplay.co.uk> In-Reply-To: <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org> References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org>
next in thread | previous in thread | raw e-mail | index | archive | help
That would indicate that TRIM on your disks is causing a problem, possibly a firmware bug causing TRIM requests to take an excessively long time to complete. What do you see from: sysctl -a | grep -E '(delete|trim)' Also while your seeing time-outs what does the output from gstat -d -p look like? Regards Steve On 13/07/2015 10:01, Yamagi Burmeister wrote: > Hello, > after some fiddling and testing I managed to track this down. TRIM is > the culprit: > > - With vfs.zfs.trim.enabled set to 1 timeouts occure. Regardless of > cabeling, of a backplane or direct connection. It doesn't matter if > Intel DC S3500 oder S3700 SSDs are connected, but on the other hand > both share the same controller. I don't have enough onboard S-ATA > ports to test the whole setup without the 9300-8i HBA, but a short > (maybe too short and without enough load) test with 6 SSDs didn't show > any timeouts. > > - With vfs.zfs.trim.enabled set to 0 I havn't seen a single timeout > for ~56 hours. > > Regards, > Yamagi > > On Tue, 7 Jul 2015 13:24:16 +0200 > Yamagi Burmeister <lists@yamagi.org> wrote: > >> Hello, >> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. >> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each >> adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE >> as of r283938 on 2 servers and r285196 on the last one. >> >> The controller identify themself as: >> >> ---- >> >> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem >> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on >> pci2 mpr0: IOCFacts : MsgVersion: 0x205 >> HeaderVersion: 0x2300 >> IOCNumber: 0 >> IOCExceptions: 0x0 >> MaxChainDepth: 128 >> NumberOfPorts: 1 >> RequestCredit: 10240 >> ProductID: 0x2221 >> IOCRequestFrameSize: 32 >> MaxInitiators: 32 >> MaxTargets: 1024 >> MaxSasExpanders: 42 >> MaxEnclosures: 43 >> HighPriorityCredit: 128 >> MaxReplyDescriptorPostQueueDepth: 65504 >> ReplyFrameSize: 32 >> MaxVolumes: 0 >> MaxDevHandle: 1106 >> MaxPersistentEntries: 128 >> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd >> mpr0: IOCCapabilities: >> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc> >> >> ---- >> >> 08.00.00.00 is the last available firmware. >> >> >> Since day one 'dmesg' is cluttered with CAM errors: >> >> ---- >> >> mpr1: Sending reset from mprsas_send_abort for target ID 5 >> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 >> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 >> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 >> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): >> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 >> state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: >> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command >> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 >> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): >> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT >> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): >> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM >> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check >> Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power >> on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying >> command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2 >> Aborting command 0xfffffe0001601a30 >> >> mpr1: Sending reset from mprsas_send_abort for target ID 2 >> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 >> length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 >> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length >> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: >> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS >> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 >> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): >> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 >> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error >> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI >> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset >> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) >> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 >> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI >> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION >> asc:29,0 (Power on, reset, or bus device reset occurred) >> (da8:mpr1:0:2:0): Retrying command (per sense data) >> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command >> 0xfffffe000160b660 >> >> ---- >> >> ZFS doesn't like this and sees read errors or even write errors. In >> extreme cases the device is marked as FAULTED: >> >> ---- >> >> pool: examplepool >> state: DEGRADED >> status: One or more devices are faulted in response to persistent >> errors. Sufficient replicas exist for the pool to continue functioning >> in a degraded state. >> action: Replace the faulted device, or use 'zpool clear' to mark the >> device repaired. >> scan: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> examplepool DEGRADED 0 0 0 >> raidz1-0 ONLINE 0 0 0 >> da3p1 ONLINE 0 0 0 >> da4p1 ONLINE 0 0 0 >> da5p1 ONLINE 0 0 0 >> logs >> da1p1 FAULTED 3 0 0 too many errors >> cache >> da1p2 FAULTED 3 0 0 too many errors >> spares >> da2p1 AVAIL >> >> errors: No known data errors >> >> ---- >> >> The problems arise on all 3 machines all all SSDs nearly daily. So I >> highly suspect a software issue. Has anyone an idea what's going on and >> what I can do to solve this problems? More information can be provided >> if necessary. >> >> Regards, >> Yamagi >> >> -- >> Homepage: www.yamagi.org >> XMPP: yamagi@yamagi.org >> GnuPG/GPG: 0xEFBCCBCB >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55A3813C.7010002>