From owner-freebsd-scsi@freebsd.org Mon Jul 13 09:13:44 2015 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 295863B61 for ; Mon, 13 Jul 2015 09:13:44 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wi0-f172.google.com (mail-wi0-f172.google.com [209.85.212.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B6A041AC2 for ; Mon, 13 Jul 2015 09:13:43 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by widjy10 with SMTP id jy10so63227769wid.1 for ; Mon, 13 Jul 2015 02:13:36 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-type :content-transfer-encoding; bh=zkq16H8nA/qpUnQGNwAkakfjlS/OY6v1xBWP7jP3yKM=; b=Z8VBn4PJ3clz2ODhJ1SnaO9W9I+auM9Ui4ueWvB48BvirFR3GDuPAoc9IW6j2gfy9n kyq3KNPOrbyynRrrrvMzQgTebVbIOJmGV1ZOpE1gBbbYUGvs/JhUrow1ND+X7ttunMhp +ztvY44qb58ud6ydFauXKOQ3tX/5QegXscdDQ/GF9F7fnyhZWXewOc+nG7wMPM9bBIdB x4H8Aqw5FnQfLs/Dd+QguO9eh9OOYDxh2UGgeMl6gUGFxl9oKHsvLZZWVgt/3Ijlizun ND57byZNoMoW8bmWoiy6s86MUq29jDJBxNnygm3c0PjHNysnqFgh1TC08NGm17RFuiYb U/1Q== X-Gm-Message-State: ALoCoQk/kf/Y+V1RMwVqS0L7ooE4f/xZyIne5LGUNwNg4mTeBjugSLc3Tt7NjFw7fmzSPLfjl1DN X-Received: by 10.180.106.137 with SMTP id gu9mr21729908wib.54.1436778816206; Mon, 13 Jul 2015 02:13:36 -0700 (PDT) Received: from [10.10.1.68] (82-69-141-170.dsl.in-addr.zen.co.uk. [82.69.141.170]) by smtp.gmail.com with ESMTPSA id k5sm13490519wij.1.2015.07.13.02.13.35 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 13 Jul 2015 02:13:35 -0700 (PDT) Subject: Re: Device timeouts(?) with LSI SAS3008 on mpr(4) To: freebsd-scsi@freebsd.org References: <20150707132416.71b44c90f7f4cd6014a304b2@yamagi.org> <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org> From: Steven Hartland Message-ID: <55A3813C.7010002@multiplay.co.uk> Date: Mon, 13 Jul 2015 10:13:32 +0100 User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:38.0) Gecko/20100101 Thunderbird/38.0.1 MIME-Version: 1.0 In-Reply-To: <20150713110148.1a27b973881b64ce2f9e98e0@yamagi.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Jul 2015 09:13:44 -0000 That would indicate that TRIM on your disks is causing a problem, possibly a firmware bug causing TRIM requests to take an excessively long time to complete. What do you see from: sysctl -a | grep -E '(delete|trim)' Also while your seeing time-outs what does the output from gstat -d -p look like? Regards Steve On 13/07/2015 10:01, Yamagi Burmeister wrote: > Hello, > after some fiddling and testing I managed to track this down. TRIM is > the culprit: > > - With vfs.zfs.trim.enabled set to 1 timeouts occure. Regardless of > cabeling, of a backplane or direct connection. It doesn't matter if > Intel DC S3500 oder S3700 SSDs are connected, but on the other hand > both share the same controller. I don't have enough onboard S-ATA > ports to test the whole setup without the 9300-8i HBA, but a short > (maybe too short and without enough load) test with 6 SSDs didn't show > any timeouts. > > - With vfs.zfs.trim.enabled set to 0 I havn't seen a single timeout > for ~56 hours. > > Regards, > Yamagi > > On Tue, 7 Jul 2015 13:24:16 +0200 > Yamagi Burmeister wrote: > >> Hello, >> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform. >> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each >> adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE >> as of r283938 on 2 servers and r285196 on the last one. >> >> The controller identify themself as: >> >> ---- >> >> mpr0: port 0x6000-0x60ff mem >> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on >> pci2 mpr0: IOCFacts : MsgVersion: 0x205 >> HeaderVersion: 0x2300 >> IOCNumber: 0 >> IOCExceptions: 0x0 >> MaxChainDepth: 128 >> NumberOfPorts: 1 >> RequestCredit: 10240 >> ProductID: 0x2221 >> IOCRequestFrameSize: 32 >> MaxInitiators: 32 >> MaxTargets: 1024 >> MaxSasExpanders: 42 >> MaxEnclosures: 43 >> HighPriorityCredit: 128 >> MaxReplyDescriptorPostQueueDepth: 65504 >> ReplyFrameSize: 32 >> MaxVolumes: 0 >> MaxDevHandle: 1106 >> MaxPersistentEntries: 128 >> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd >> mpr0: IOCCapabilities: >> 7a85c >> >> ---- >> >> 08.00.00.00 is the last available firmware. >> >> >> Since day one 'dmesg' is cluttered with CAM errors: >> >> ---- >> >> mpr1: Sending reset from mprsas_send_abort for target ID 5 >> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08 >> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0 >> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00 >> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0): >> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0 >> state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1: >> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command >> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 >> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0): >> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT >> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0): >> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM >> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check >> Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power >> on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying >> command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2 >> Aborting command 0xfffffe0001601a30 >> >> mpr1: Sending reset from mprsas_send_abort for target ID 2 >> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00 >> length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0 >> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length >> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1: >> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS >> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 >> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0): >> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 >> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error >> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI >> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset >> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data) >> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00 >> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI >> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION >> asc:29,0 (Power on, reset, or bus device reset occurred) >> (da8:mpr1:0:2:0): Retrying command (per sense data) >> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command >> 0xfffffe000160b660 >> >> ---- >> >> ZFS doesn't like this and sees read errors or even write errors. In >> extreme cases the device is marked as FAULTED: >> >> ---- >> >> pool: examplepool >> state: DEGRADED >> status: One or more devices are faulted in response to persistent >> errors. Sufficient replicas exist for the pool to continue functioning >> in a degraded state. >> action: Replace the faulted device, or use 'zpool clear' to mark the >> device repaired. >> scan: none requested >> config: >> >> NAME STATE READ WRITE CKSUM >> examplepool DEGRADED 0 0 0 >> raidz1-0 ONLINE 0 0 0 >> da3p1 ONLINE 0 0 0 >> da4p1 ONLINE 0 0 0 >> da5p1 ONLINE 0 0 0 >> logs >> da1p1 FAULTED 3 0 0 too many errors >> cache >> da1p2 FAULTED 3 0 0 too many errors >> spares >> da2p1 AVAIL >> >> errors: No known data errors >> >> ---- >> >> The problems arise on all 3 machines all all SSDs nearly daily. So I >> highly suspect a software issue. Has anyone an idea what's going on and >> what I can do to solve this problems? More information can be provided >> if necessary. >> >> Regards, >> Yamagi >> >> -- >> Homepage: www.yamagi.org >> XMPP: yamagi@yamagi.org >> GnuPG/GPG: 0xEFBCCBCB >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >