From owner-freebsd-scsi@freebsd.org Mon Oct 24 20:46:40 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A68FAC1F95C for ; Mon, 24 Oct 2016 20:46:40 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 17FB1974 for ; Mon, 24 Oct 2016 20:46:39 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x232.google.com with SMTP id d199so9302300wmd.0 for ; Mon, 24 Oct 2016 13:46:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=DxKASzD73BAHa3gINuwLeZ/EJa+3eVl/pMyF6Pv8FCY=; b=TX2ABLN9RXH8Pud8UKYeu9BiCDaIlRAN+ZL5fuEsbGP0MTjR9TtG6NkbnCtp7szJEC fnXdCg9nBQtE91gCgV2UgAUuRkhZZqTbnZVhd+Ke8/gHOnlvqXoGvTgNpAPexLPZZ+yf mBGP9E2I4pGRFO1lIZA/sPWCF+N0FtUdC21rwbJMWba28pOmNWZigT5qTyLvZNnw6pwO pVi/cAP61sCdyTPZkDL2XmU283hMAmbCOTdKdyqvpBc/sgF4Kk+SQpR3z8cWZOhAR1AD /7FukEfIfUbZmkpUvXH1n/RoEW9J+tt3h58MH6jTDwJW99lU1ecQeABFLVpi+mb0Fy2D iWjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=DxKASzD73BAHa3gINuwLeZ/EJa+3eVl/pMyF6Pv8FCY=; b=MoK9xHVyzjolTwdDvkttWXDc+M+Vu9zq8LYrNNbQTMRhAQKc12IS6lNKmpD15N5jbT f+56qEZfXTljNbZyQ33rO/cy+rEoLDg883rCJ0v7Aih2QCt9y3BTHkGCWIZSQK/39tMl 7rPHudv5bS05qHStXoOAAjXBoVEIbIn86t5cuiCzpBWhaGbTIQz8fQbpXyZ3sJjbwPHh VofFz1RujiSwe5pgGOoWOlLKFlQ9S09qtO9bAeUMfyVQSUUiF91PDfOFKI83/D7FfMBe 9Ns3lxynHUrYWDjyrX0wYyw2Yv6QsKKCQED9614O61GuWiiv39QGYdJJ6V2nXPNps47K K1sQ== X-Gm-Message-State: AA6/9Rkp9oowJcLNH8nw28xlRIt5Tob+b2KOpWlBrLh17haBYbGukUR72ehFgs6uu/UsopIr X-Received: by 10.28.181.68 with SMTP id e65mr17950443wmf.131.1477341997468; Mon, 24 Oct 2016 13:46:37 -0700 (PDT) Received: from [10.10.1.58] ([185.97.61.26]) by smtp.gmail.com with ESMTPSA id e2sm21189660wjw.14.2016.10.24.13.46.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Oct 2016 13:46:35 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> From: Steven Hartland Message-ID: Date: Mon, 24 Oct 2016 21:47:08 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Oct 2016 20:46:40 -0000 Thanks for the update, good to know! On 24/10/2016 19:39, list-news wrote: > Someone on the list contacted me directly and asked if I had ever come > to a conclusion on the SAS 3008 issues I was finding back in June. > Firstly, my apologies to everyone, I had forgotten to get back to the > list on my findings. > > Eventually I ended up cross flashing the card to an LSI 9300-8i > firmware version 12. Cross flashing obviously wasn't supported by > SuperMicro but I was without options and they didn't have a v12 > firmware available directly. > > This v12 firmware functioned stable, and caused the drives to stop > disconnecting (and subsequently, zfs was no longer showing errors at > 100% io for 24+ hours). I posted my findings to SuperMicro, and they > quickly released a v12 firmware for the SAS 3008 HBA, which as I > understand is now publicly available on their site. > > The story is long, and painful, but in the end the answer is to > absolutely make sure you are running v12 firmware on the SAS 3008 > cards. If you are not running v12 (or later) firmware on the SAS > 3008, your data is unstable! (at least with intel drives, probably > others as well) > > Thanks again for all the help/ideas from everyone. It's been a few > months now with zero errors reporting! > > -Kyle > > > On 6/7/16 6:39 PM, David Gwynne wrote: >>> On 8 Jun 2016, at 09:30, Steven Hartland >>> wrote: >>> >>> Oh another thing to test is iirc 3008 is supported by mrsas so you >>> might want to try adding the following into loader.conf to switch >>> drivers: >>> hw.mfi.mrsas_enable="1" >> i believe the 3008s can run two different firmwares, one that >> provides the mpt2 interface and the other than provides the megaraid >> sas fusion interface. you have to flash them to switch though, you >> cant just point a driver at it and hope for the best. >> >> each fw presents different pci ids. eg, in >> http://pciids.sourceforge.net/v2.2/pci.ids you can see: >> >> 005f MegaRAID SAS-3 3008 [Fury] >> 0097 SAS3008 PCI-Express Fusion-MPT SAS-3 >> >> dlg >> >>> On 07/06/2016 23:43, list-news wrote: >>>> No, it threw errors on both da6 and da7 and then I stopped it. >>>> >>>> Your last e-mail gave me thoughts though. I have a server with >>>> 2008 controllers (entirely different backplane design, cpu, memory, >>>> etc). I've moved the 4 drives to that and I'm running the test now. >>>> >>>> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 >>>> # sysctl dev.mps.0 >>>> dev.mps.0.spinup_wait_time: 3 >>>> dev.mps.0.chain_alloc_fail: 0 >>>> dev.mps.0.enable_ssu: 1 >>>> dev.mps.0.max_chains: 2048 >>>> dev.mps.0.chain_free_lowwater: 1176 >>>> dev.mps.0.chain_free: 2048 >>>> dev.mps.0.io_cmds_highwater: 510 >>>> dev.mps.0.io_cmds_active: 0 >>>> dev.mps.0.driver_version: 20.00.00.00-fbsd >>>> dev.mps.0.firmware_version: 17.00.01.00 >>>> dev.mps.0.disable_msi: 0 >>>> dev.mps.0.disable_msix: 0 >>>> dev.mps.0.debug_level: 3 >>>> dev.mps.0.%parent: pci5 >>>> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 >>>> subdevice=0x3020 class=0x010700 >>>> dev.mps.0.%location: slot=0 function=0 >>>> dev.mps.0.%driver: mps >>>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 >>>> >>>> About 1.5 hours has passed at full load, no errors. >>>> >>>> gstat drive busy% seems to hang out around 30-40 instead of >>>> ~60-70. Overall throughput seems to be 20-30% less with my rough >>>> benchmarks. >>>> >>>> I'm not sure if this gets us closer to the answer, if this doesn't >>>> time-out on the 2008 controller, it looks like one of these: >>>> 1) The Intel drive firmware is being overloaded somehow when >>>> connected to the 3008. >>>> or >>>> 2) The 3008 firmware or driver has an issue reading drive >>>> responses, sporadically thinking the command timed-out (when maybe >>>> it really didn't). >>>> >>>> Puzzle pieces: >>>> A) Why does setting tags of 1 on drives connected to the 3008 fix >>>> the problem? >>>> B) With tags of 255. Why does postgres (and assuming a large fsync >>>> count), seem to cause the problem within minutes? While running >>>> other heavy i/o commands (zpool scrub, bonnie++, fio), all of which >>>> show similarly high or higher iops take hours to cause the problem >>>> (if ever). >>>> >>>> I'll let this continue to run to further test. >>>> >>>> Thanks again for all the help. >>>> >>>> -Kyle >>>> >>>> On 6/7/16 4:22 PM, Steven Hartland wrote: >>>>> Always da6? >>>>> >>>>> On 07/06/2016 21:19, list-news wrote: >>>>>> Sure Steve: >>>>>> >>>>>> # cat /boot/loader.conf | grep trim >>>>>> vfs.zfs.trim.enabled=0 >>>>>> >>>>>> # sysctl vfs.zfs.trim.enabled >>>>>> vfs.zfs.trim.enabled: 0 >>>>>> >>>>>> # uptime >>>>>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>>>>> >>>>>> # tail -f /var/log/messages: >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a >>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout >>>>>> cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, >>>>>> handle(0x0010) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>>>>> connector name ( ) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 >>>>>> allocated tm 0xfffffe0001322150 >>>>>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID >>>>>> 1 Aborting command 0xfffffe0001375580 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from >>>>>> mprsas_send_abort for target ID 16 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>>>>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 >>>>>> target 16, handle(0x0010) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>>>>> connector name ( ) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm >>>>>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm >>>>>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm >>>>>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm >>>>>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a >>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed >>>>>> timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during >>>>>> recovery ioc 8048 scsi 0 state c (da6:mpr0:0:16:0): >>>>>> SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length >>>>>> 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): >>>>>> WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >>>>>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 >>>>>> during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: >>>>>> Command timeout >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>>>>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 >>>>>> abort TaskMID 1016 status 0x0 code 0x0 count 5 >>>>>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 >>>>>> finished recovery after aborting TaskMID 1016 >>>>>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>>>>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>>>>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>>>>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>>>>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>>>>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>>>>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working >>>>>> on Event: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event >>>>>> Free: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working >>>>>> on Event: [1c] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event >>>>>> Free: [1c] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working >>>>>> on Event: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event >>>>>> Free: [16] >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI >>>>>> Status Error >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check >>>>>> Condition >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT >>>>>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command >>>>>> (per sense data) >>>>>> >>>>>> -Kyle >>>>>> >>>>>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using >>>>>>> BIO delete but assuming your running ZFS can you set the >>>>>>> following in loader.conf and see how you get on. >>>>>>> vfs.zfs.trim.enabled=0 >>>>>>> >>>>>>> Regards >>>>>>> Steve >>>>>> >>>>>> _______________________________________________ >>>>>> freebsd-scsi@freebsd.org mailing list >>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-scsi-unsubscribe@freebsd.org" >>>>> _______________________________________________ >>>>> freebsd-scsi@freebsd.org mailing list >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>>> To unsubscribe, send any mail to >>>>> "freebsd-scsi-unsubscribe@freebsd.org" >>>> >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to >>>> "freebsd-scsi-unsubscribe@freebsd.org" >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"