Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Jul 2015 11:01:48 +0200
From:      Yamagi Burmeister <lists@yamagi.org>
To:        freebsd-scsi@freebsd.org
Subject:   Re: Device timeouts(?) with LSI SAS3008 on mpr(4)
Message-ID:  <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org>
In-Reply-To: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org>
References:  <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org>

next in thread | previous in thread | raw e-mail | index | archive | help
Hello,
after some fiddling and testing I managed to track this down. TRIM is
the culprit:

- With vfs.zfs.trim.enabled set to 1 timeouts occure. Regardless of 
  cabeling, of a backplane or direct connection. It doesn't matter if
  Intel DC S3500 oder S3700 SSDs are connected, but on the other hand
  both share the same controller. I don't have enough onboard S-ATA
  ports to test the whole setup without the 9300-8i HBA, but a short
  (maybe too short and without enough load) test with 6 SSDs didn't show
  any timeouts.

- With vfs.zfs.trim.enabled set to 0 I havn't seen a single timeout
  for ~56 hours.

Regards,
Yamagi

On Tue, 7 Jul 2015 13:24:16 +0200
Yamagi Burmeister <lists@yamagi.org> wrote:

> Hello,
> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
> adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE
> as of r283938 on 2 servers and r285196 on the last one. 
> 
> The controller identify themself as:
> 
> ----
> 
> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
> pci2 mpr0: IOCFacts  : MsgVersion: 0x205
>         HeaderVersion: 0x2300
>         IOCNumber: 0
>         IOCExceptions: 0x0
>         MaxChainDepth: 128
>         NumberOfPorts: 1
>         RequestCredit: 10240
>         ProductID: 0x2221
>         IOCRequestFrameSize: 32
>         MaxInitiators: 32
>         MaxTargets: 1024
>         MaxSasExpanders: 42
>         MaxEnclosures: 43
>         HighPriorityCredit: 128
>         MaxReplyDescriptorPostQueueDepth: 65504
>         ReplyFrameSize: 32
>         MaxVolumes: 0
>         MaxDevHandle: 1106
>         MaxPersistentEntries: 128
> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
> mpr0: IOCCapabilities:
> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
> 
> ----
> 
> 08.00.00.00 is the last available firmware.
> 
> 
> Since day one 'dmesg' is cluttered with CAM errors:
> 
> ----
> 
> mpr1: Sending reset from mprsas_send_abort for target ID 5
>         (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
> state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
> Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power
> on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying
> command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2
> Aborting command 0xfffffe0001601a30
> 
> mpr1: Sending reset from mprsas_send_abort for target ID 2
>         (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
> length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
> asc:29,0 (Power on, reset, or bus device reset occurred)
> (da8:mpr1:0:2:0): Retrying command (per sense data)
> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
> 0xfffffe000160b660
> 
> ----
> 
> ZFS doesn't like this and sees read errors or even write errors. In
> extreme cases the device is marked as FAULTED:
> 
> ----
> 
>   pool: examplepool
>  state: DEGRADED
> status: One or more devices are faulted in response to persistent
> errors. Sufficient replicas exist for the pool to continue functioning
> in a degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the
> device repaired.
>   scan: none requested
> config:
> 
> 	NAME        STATE     READ WRITE CKSUM
> 	examplepool DEGRADED     0     0     0
> 	  raidz1-0  ONLINE       0     0     0
> 	    da3p1   ONLINE       0     0     0
> 	    da4p1   ONLINE       0     0     0
> 	    da5p1   ONLINE       0     0     0
> 	logs
> 	  da1p1     FAULTED      3     0     0  too many errors
> 	cache
> 	  da1p2     FAULTED      3     0     0  too many errors
> 	spares
> 	  da2p1     AVAIL   
> 
> errors: No known data errors
> 
> ----
> 
> The problems arise on all 3 machines all all SSDs nearly daily. So I
> highly suspect a software issue. Has anyone an idea what's going on and
> what I can do to solve this problems? More information can be provided
> if necessary.
> 
> Regards,
> Yamagi
> 
> -- 
> Homepage:  www.yamagi.org
> XMPP:      yamagi@yamagi.org
> GnuPG/GPG: 0xEFBCCBCB
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"


-- 
Homepage:  www.yamagi.org
XMPP:      yamagi@yamagi.org
GnuPG/GPG: 0xEFBCCBCB



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150713110148.1a27b973881b64ce2f9e98e0>