From owner-freebsd-scsi@freebsd.org  Mon Jul 13 09:13:44 2015
Return-Path: <owner-freebsd-scsi@freebsd.org>
Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 295863B61
 for <freebsd-scsi@mailman.ysv.freebsd.org>;
 Mon, 13 Jul 2015 09:13:44 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com
 [209.85.212.172])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id B6A041AC2
 for <freebsd-scsi@freebsd.org>; Mon, 13 Jul 2015 09:13:43 +0000 (UTC)
 (envelope-from killing@multiplay.co.uk)
Received: by widjy10 with SMTP id jy10so63227769wid.1
 for <freebsd-scsi@freebsd.org>; Mon, 13 Jul 2015 02:13:36 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:subject:to:references:from:message-id:date
 :user-agent:mime-version:in-reply-to:content-type
 :content-transfer-encoding;
 bh=zkq16H8nA/qpUnQGNwAkakfjlS/OY6v1xBWP7jP3yKM=;
 b=Z8VBn4PJ3clz2ODhJ1SnaO9W9I+auM9Ui4ueWvB48BvirFR3GDuPAoc9IW6j2gfy9n
 kyq3KNPOrbyynRrrrvMzQgTebVbIOJmGV1ZOpE1gBbbYUGvs/JhUrow1ND+X7ttunMhp
 +ztvY44qb58ud6ydFauXKOQ3tX/5QegXscdDQ/GF9F7fnyhZWXewOc+nG7wMPM9bBIdB
 x4H8Aqw5FnQfLs/Dd+QguO9eh9OOYDxh2UGgeMl6gUGFxl9oKHsvLZZWVgt/3Ijlizun
 ND57byZNoMoW8bmWoiy6s86MUq29jDJBxNnygm3c0PjHNysnqFgh1TC08NGm17RFuiYb
 U/1Q==
X-Gm-Message-State: ALoCoQk/kf/Y+V1RMwVqS0L7ooE4f/xZyIne5LGUNwNg4mTeBjugSLc3Tt7NjFw7fmzSPLfjl1DN
X-Received: by 10.180.106.137 with SMTP id gu9mr21729908wib.54.1436778816206; 
 Mon, 13 Jul 2015 02:13:36 -0700 (PDT)
Received: from [10.10.1.68] (82-69-141-170.dsl.in-addr.zen.co.uk.
 [82.69.141.170])
 by smtp.gmail.com with ESMTPSA id k5sm13490519wij.1.2015.07.13.02.13.35
 for <freebsd-scsi@freebsd.org>
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 13 Jul 2015 02:13:35 -0700 (PDT)
Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4)
To: freebsd-scsi@freebsd.org
References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org>
 <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org>
From: Steven Hartland <killing@multiplay.co.uk>
Message-ID: <55A3813C.7010002@multiplay.co.uk>
Date: Mon, 13 Jul 2015 10:13:32 +0100
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101
 Thunderbird/38.0.1
MIME-Version: 1.0
In-Reply-To: <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-scsi@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: SCSI subsystem <freebsd-scsi.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-scsi/>
List-Post: <mailto:freebsd-scsi@freebsd.org>
List-Help: <mailto:freebsd-scsi-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-scsi>,
 <mailto:freebsd-scsi-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 13 Jul 2015 09:13:44 -0000

That would indicate that TRIM on your disks is causing a problem, 
possibly a firmware bug causing TRIM requests to take an excessively 
long time to complete.

What do you see from:
sysctl -a | grep -E '(delete|trim)'

Also while your seeing time-outs what does the output from gstat -d -p 
look like?

     Regards
     Steve

On 13/07/2015 10:01, Yamagi Burmeister wrote:
> Hello,
> after some fiddling and testing I managed to track this down. TRIM is
> the culprit:
>
> - With vfs.zfs.trim.enabled set to 1 timeouts occure. Regardless of
>    cabeling, of a backplane or direct connection. It doesn't matter if
>    Intel DC S3500 oder S3700 SSDs are connected, but on the other hand
>    both share the same controller. I don't have enough onboard S-ATA
>    ports to test the whole setup without the 9300-8i HBA, but a short
>    (maybe too short and without enough load) test with 6 SSDs didn't show
>    any timeouts.
>
> - With vfs.zfs.trim.enabled set to 0 I havn't seen a single timeout
>    for ~56 hours.
>
> Regards,
> Yamagi
>
> On Tue, 7 Jul 2015 13:24:16 +0200
> Yamagi Burmeister <lists@yamagi.org> wrote:
>
>> Hello,
>> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
>> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
>> adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE
>> as of r283938 on 2 servers and r285196 on the last one.
>>
>> The controller identify themself as:
>>
>> ----
>>
>> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
>> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
>> pci2 mpr0: IOCFacts  : MsgVersion: 0x205
>>          HeaderVersion: 0x2300
>>          IOCNumber: 0
>>          IOCExceptions: 0x0
>>          MaxChainDepth: 128
>>          NumberOfPorts: 1
>>          RequestCredit: 10240
>>          ProductID: 0x2221
>>          IOCRequestFrameSize: 32
>>          MaxInitiators: 32
>>          MaxTargets: 1024
>>          MaxSasExpanders: 42
>>          MaxEnclosures: 43
>>          HighPriorityCredit: 128
>>          MaxReplyDescriptorPostQueueDepth: 65504
>>          ReplyFrameSize: 32
>>          MaxVolumes: 0
>>          MaxDevHandle: 1106
>>          MaxPersistentEntries: 128
>> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
>> mpr0: IOCCapabilities:
>> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>>
>> ----
>>
>> 08.00.00.00 is the last available firmware.
>>
>>
>> Since day one 'dmesg' is cluttered with CAM errors:
>>
>> ----
>>
>> mpr1: Sending reset from mprsas_send_abort for target ID 5
>>          (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
>> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
>> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
>> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
>> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
>> state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
>> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
>> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
>> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
>> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
>> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
>> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
>> Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power
>> on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying
>> command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2
>> Aborting command 0xfffffe0001601a30
>>
>> mpr1: Sending reset from mprsas_send_abort for target ID 2
>>          (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
>> length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
>> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
>> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
>> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
>> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
>> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
>> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
>> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
>> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
>> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
>> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
>> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
>> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
>> asc:29,0 (Power on, reset, or bus device reset occurred)
>> (da8:mpr1:0:2:0): Retrying command (per sense data)
>> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
>> 0xfffffe000160b660
>>
>> ----
>>
>> ZFS doesn't like this and sees read errors or even write errors. In
>> extreme cases the device is marked as FAULTED:
>>
>> ----
>>
>>    pool: examplepool
>>   state: DEGRADED
>> status: One or more devices are faulted in response to persistent
>> errors. Sufficient replicas exist for the pool to continue functioning
>> in a degraded state.
>> action: Replace the faulted device, or use 'zpool clear' to mark the
>> device repaired.
>>    scan: none requested
>> config:
>>
>> 	NAME        STATE     READ WRITE CKSUM
>> 	examplepool DEGRADED     0     0     0
>> 	  raidz1-0  ONLINE       0     0     0
>> 	    da3p1   ONLINE       0     0     0
>> 	    da4p1   ONLINE       0     0     0
>> 	    da5p1   ONLINE       0     0     0
>> 	logs
>> 	  da1p1     FAULTED      3     0     0  too many errors
>> 	cache
>> 	  da1p2     FAULTED      3     0     0  too many errors
>> 	spares
>> 	  da2p1     AVAIL
>>
>> errors: No known data errors
>>
>> ----
>>
>> The problems arise on all 3 machines all all SSDs nearly daily. So I
>> highly suspect a software issue. Has anyone an idea what's going on and
>> what I can do to solve this problems? More information can be provided
>> if necessary.
>>
>> Regards,
>> Yamagi
>>
>> -- 
>> Homepage:  www.yamagi.org
>> XMPP:      yamagi@yamagi.org
>> GnuPG/GPG: 0xEFBCCBCB
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>