From owner-freebsd-scsi@freebsd.org Mon Oct 24 18:47:05 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 58183C1F927 for ; Mon, 24 Oct 2016 18:47:05 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mail.furymx.com (mindpack.mx1.furymx.net [64.141.130.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 306F9BA2 for ; Mon, 24 Oct 2016 18:47:04 +0000 (UTC) (envelope-from list-news@mindpackstudios.com) Received: from mindpack.furymx.net (mindpack.mx1.furymx.net [10.10.1.10]) by mail.furymx.com (Postfix) with ESMTP id 2951336C615 for ; Mon, 24 Oct 2016 13:39:26 -0500 (CDT) X-Virus-Scanned: amavisd-new at furymx.com Received: from mail.furymx.com ([10.10.1.10]) by mindpack.furymx.net (mail.furymx.com [10.10.1.10]) (amavisd-new, port 10024) with ESMTP id s2gVB5jNL86p for ; Mon, 24 Oct 2016 13:39:24 -0500 (CDT) Received: from vortex.local (c-98-215-180-176.hsd1.in.comcast.net [98.215.180.176]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: kyle@mindpackstudios.com) by mail.furymx.com (Postfix) with ESMTPSA id 6C4E336C5FA for ; Mon, 24 Oct 2016 13:39:24 -0500 (CDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> From: list-news Message-ID: <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> Date: Mon, 24 Oct 2016 13:39:24 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Oct 2016 18:47:05 -0000 Someone on the list contacted me directly and asked if I had ever come to a conclusion on the SAS 3008 issues I was finding back in June. Firstly, my apologies to everyone, I had forgotten to get back to the list on my findings. Eventually I ended up cross flashing the card to an LSI 9300-8i firmware version 12. Cross flashing obviously wasn't supported by SuperMicro but I was without options and they didn't have a v12 firmware available directly. This v12 firmware functioned stable, and caused the drives to stop disconnecting (and subsequently, zfs was no longer showing errors at 100% io for 24+ hours). I posted my findings to SuperMicro, and they quickly released a v12 firmware for the SAS 3008 HBA, which as I understand is now publicly available on their site. The story is long, and painful, but in the end the answer is to absolutely make sure you are running v12 firmware on the SAS 3008 cards. If you are not running v12 (or later) firmware on the SAS 3008, your data is unstable! (at least with intel drives, probably others as well) Thanks again for all the help/ideas from everyone. It's been a few months now with zero errors reporting! -Kyle On 6/7/16 6:39 PM, David Gwynne wrote: >> On 8 Jun 2016, at 09:30, Steven Hartland wrote: >> >> Oh another thing to test is iirc 3008 is supported by mrsas so you might want to try adding the following into loader.conf to switch drivers: >> hw.mfi.mrsas_enable="1" > i believe the 3008s can run two different firmwares, one that provides the mpt2 interface and the other than provides the megaraid sas fusion interface. you have to flash them to switch though, you cant just point a driver at it and hope for the best. > > each fw presents different pci ids. eg, in http://pciids.sourceforge.net/v2.2/pci.ids you can see: > > 005f MegaRAID SAS-3 3008 [Fury] > 0097 SAS3008 PCI-Express Fusion-MPT SAS-3 > > dlg > >> On 07/06/2016 23:43, list-news wrote: >>> No, it threw errors on both da6 and da7 and then I stopped it. >>> >>> Your last e-mail gave me thoughts though. I have a server with 2008 controllers (entirely different backplane design, cpu, memory, etc). I've moved the 4 drives to that and I'm running the test now. >>> >>> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 >>> # sysctl dev.mps.0 >>> dev.mps.0.spinup_wait_time: 3 >>> dev.mps.0.chain_alloc_fail: 0 >>> dev.mps.0.enable_ssu: 1 >>> dev.mps.0.max_chains: 2048 >>> dev.mps.0.chain_free_lowwater: 1176 >>> dev.mps.0.chain_free: 2048 >>> dev.mps.0.io_cmds_highwater: 510 >>> dev.mps.0.io_cmds_active: 0 >>> dev.mps.0.driver_version: 20.00.00.00-fbsd >>> dev.mps.0.firmware_version: 17.00.01.00 >>> dev.mps.0.disable_msi: 0 >>> dev.mps.0.disable_msix: 0 >>> dev.mps.0.debug_level: 3 >>> dev.mps.0.%parent: pci5 >>> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 subdevice=0x3020 class=0x010700 >>> dev.mps.0.%location: slot=0 function=0 >>> dev.mps.0.%driver: mps >>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 >>> >>> About 1.5 hours has passed at full load, no errors. >>> >>> gstat drive busy% seems to hang out around 30-40 instead of ~60-70. Overall throughput seems to be 20-30% less with my rough benchmarks. >>> >>> I'm not sure if this gets us closer to the answer, if this doesn't time-out on the 2008 controller, it looks like one of these: >>> 1) The Intel drive firmware is being overloaded somehow when connected to the 3008. >>> or >>> 2) The 3008 firmware or driver has an issue reading drive responses, sporadically thinking the command timed-out (when maybe it really didn't). >>> >>> Puzzle pieces: >>> A) Why does setting tags of 1 on drives connected to the 3008 fix the problem? >>> B) With tags of 255. Why does postgres (and assuming a large fsync count), seem to cause the problem within minutes? While running other heavy i/o commands (zpool scrub, bonnie++, fio), all of which show similarly high or higher iops take hours to cause the problem (if ever). >>> >>> I'll let this continue to run to further test. >>> >>> Thanks again for all the help. >>> >>> -Kyle >>> >>> On 6/7/16 4:22 PM, Steven Hartland wrote: >>>> Always da6? >>>> >>>> On 07/06/2016 21:19, list-news wrote: >>>>> Sure Steve: >>>>> >>>>> # cat /boot/loader.conf | grep trim >>>>> vfs.zfs.trim.enabled=0 >>>>> >>>>> # sysctl vfs.zfs.trim.enabled >>>>> vfs.zfs.trim.enabled: 0 >>>>> >>>>> # uptime >>>>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>>>> >>>>> # tail -f /var/log/messages: >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name ( ) >>>>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 allocated tm 0xfffffe0001322150 >>>>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0001375580 >>>>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort for target ID 16 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010) >>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, connector name ( ) >>>>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 state c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >>>>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout >>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>>>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort TaskMID 1016 status 0x0 code 0x0 count 5 >>>>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 finished recovery after aborting TaskMID 1016 >>>>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>>>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>>>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>>>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>>>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>>>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>>>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>>>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>>>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on Event: [16] >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16] >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on Event: [1c] >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c] >>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on Event: [16] >>>>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16] >>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error >>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition >>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per sense data) >>>>> >>>>> -Kyle >>>>> >>>>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO delete but assuming your running ZFS can you set the following in loader.conf and see how you get on. >>>>>> vfs.zfs.trim.enabled=0 >>>>>> >>>>>> Regards >>>>>> Steve >>>>> >>>>> _______________________________________________ >>>>> freebsd-scsi@freebsd.org mailing list >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >>> >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Mon Oct 24 19:38:50 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9C4ADC1FD54 for ; Mon, 24 Oct 2016 19:38:50 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com [IPv6:2a00:1450:400c:c09::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 29E86FCE for ; Mon, 24 Oct 2016 19:38:50 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x235.google.com with SMTP id c78so134097559wme.1 for ; Mon, 24 Oct 2016 12:38:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=7XzsmmWc+z7w8tQPb2ZpiLyNmBZ8ox4z3HbenvvrOU0=; b=Hh/PjzBSfVwi+persKbK3DCYp7jurykGR4RO9z9P6SKUf20qqmhvpk7Q8mf2/prHbC KC8HSlS+Zy7jYVB/LMbYFVlcJCYwwqr8Vxbj4OVTm+b3cTurJLBcO2sUBqQ15h2AxH48 NvieTAnZ+R0GHTsVUYyL7MdKftDb935iFP4csUrVhKSgH2xJonlqGyQNbZXF9UJweEIS hiEKeFBrzK8ZCQd65vfbWEuLbEBy+QjgBFiqB1ivVjATUzp0cKDMA+nswgzGh6j6sY94 +PhFfdJrMQN1wpgK8eZguv0TZ00HUfivyMwbG0tTWH+UIJEtQeQnX+VnLETwrdKuWXs2 6cPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=7XzsmmWc+z7w8tQPb2ZpiLyNmBZ8ox4z3HbenvvrOU0=; b=dy6cdYocvOTs5g2fCv5ZPcY35P0LHvoPPZLYM5AlynlvFTaYgOd8cSKYS11K2NlYez EoJGUwUwlKbIdeZ8etYYk9NoJsmWMqD5tF9dxzd67+vwN7PT6yn6h22GsB9tXZyP8U2i mgQVtwnJp21OF7PCsFczg1el8j+HFmGuYSpxJTi3BBAClNuxiLxex2axBMgHmIVmRHlm CwaYgZvrgRruYRqWqdUJ5Y79KFvmiFd5P+2kB6Do5gpGF0emTlrjL5vcMkM0E/5Cos3J +75uNICxEeQ2Fvcct4MUu3FqnhUXIKIeiirIY5ZBAcyZsP3eGzLk6FWOjPgJwiG2+B6b t7uA== X-Gm-Message-State: AA6/9Rk9jdtSnf8FROl5Ry1AT+DpKiVJXhR2K8RH2/c+xxxYWJDV7AfOaK4GAdmWmdrf7w== X-Received: by 10.28.152.78 with SMTP id a75mr22867950wme.56.1477337927221; Mon, 24 Oct 2016 12:38:47 -0700 (PDT) Received: from [192.168.0.102] (LFbn-1-7159-4.w90-116.abo.wanadoo.fr. [90.116.90.4]) by smtp.gmail.com with ESMTPSA id k74sm16393977wmd.18.2016.10.24.12.38.46 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Mon, 24 Oct 2016 12:38:46 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts From: Ben RUBSON In-Reply-To: <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> Date: Mon, 24 Oct 2016 21:38:45 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <2788DFF9-4F23-43AB-8A80-25A14A7BBE6F@gmail.com> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> To: freebsd-scsi@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Oct 2016 19:38:50 -0000 > On 24 Oct 2016, at 20:39, list-news = wrote: >=20 > I posted my findings to SuperMicro, and they quickly released a v12 = firmware for the SAS 3008 HBA, which as I understand is now publicly = available on their site. >=20 > The story is long, and painful, but in the end the answer is to = absolutely make sure you are running v12 firmware on the SAS 3008 cards. = If you are not running v12 (or later) firmware on the SAS 3008, your = data is unstable! (at least with intel drives, probably others as well) Thank you for this interesting feedback. We were talking about 3008 reliability vs 2008 a few days ago = @freebsd-fs. Reading your input, a best-practice would also be to use Avago cards = instead of rebranded ones, to benefit from last firmware versions. Ben From owner-freebsd-scsi@freebsd.org Mon Oct 24 20:46:40 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A68FAC1F95C for ; Mon, 24 Oct 2016 20:46:40 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 17FB1974 for ; Mon, 24 Oct 2016 20:46:39 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x232.google.com with SMTP id d199so9302300wmd.0 for ; Mon, 24 Oct 2016 13:46:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=DxKASzD73BAHa3gINuwLeZ/EJa+3eVl/pMyF6Pv8FCY=; b=TX2ABLN9RXH8Pud8UKYeu9BiCDaIlRAN+ZL5fuEsbGP0MTjR9TtG6NkbnCtp7szJEC fnXdCg9nBQtE91gCgV2UgAUuRkhZZqTbnZVhd+Ke8/gHOnlvqXoGvTgNpAPexLPZZ+yf mBGP9E2I4pGRFO1lIZA/sPWCF+N0FtUdC21rwbJMWba28pOmNWZigT5qTyLvZNnw6pwO pVi/cAP61sCdyTPZkDL2XmU283hMAmbCOTdKdyqvpBc/sgF4Kk+SQpR3z8cWZOhAR1AD /7FukEfIfUbZmkpUvXH1n/RoEW9J+tt3h58MH6jTDwJW99lU1ecQeABFLVpi+mb0Fy2D iWjw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=DxKASzD73BAHa3gINuwLeZ/EJa+3eVl/pMyF6Pv8FCY=; b=MoK9xHVyzjolTwdDvkttWXDc+M+Vu9zq8LYrNNbQTMRhAQKc12IS6lNKmpD15N5jbT f+56qEZfXTljNbZyQ33rO/cy+rEoLDg883rCJ0v7Aih2QCt9y3BTHkGCWIZSQK/39tMl 7rPHudv5bS05qHStXoOAAjXBoVEIbIn86t5cuiCzpBWhaGbTIQz8fQbpXyZ3sJjbwPHh VofFz1RujiSwe5pgGOoWOlLKFlQ9S09qtO9bAeUMfyVQSUUiF91PDfOFKI83/D7FfMBe 9Ns3lxynHUrYWDjyrX0wYyw2Yv6QsKKCQED9614O61GuWiiv39QGYdJJ6V2nXPNps47K K1sQ== X-Gm-Message-State: AA6/9Rkp9oowJcLNH8nw28xlRIt5Tob+b2KOpWlBrLh17haBYbGukUR72ehFgs6uu/UsopIr X-Received: by 10.28.181.68 with SMTP id e65mr17950443wmf.131.1477341997468; Mon, 24 Oct 2016 13:46:37 -0700 (PDT) Received: from [10.10.1.58] ([185.97.61.26]) by smtp.gmail.com with ESMTPSA id e2sm21189660wjw.14.2016.10.24.13.46.35 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Oct 2016 13:46:35 -0700 (PDT) Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts To: freebsd-scsi@freebsd.org References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> From: Steven Hartland Message-ID: Date: Mon, 24 Oct 2016 21:47:08 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <2f8e4a03-8a1f-ead5-3886-75639ab7a68f@mindpackstudios.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 24 Oct 2016 20:46:40 -0000 Thanks for the update, good to know! On 24/10/2016 19:39, list-news wrote: > Someone on the list contacted me directly and asked if I had ever come > to a conclusion on the SAS 3008 issues I was finding back in June. > Firstly, my apologies to everyone, I had forgotten to get back to the > list on my findings. > > Eventually I ended up cross flashing the card to an LSI 9300-8i > firmware version 12. Cross flashing obviously wasn't supported by > SuperMicro but I was without options and they didn't have a v12 > firmware available directly. > > This v12 firmware functioned stable, and caused the drives to stop > disconnecting (and subsequently, zfs was no longer showing errors at > 100% io for 24+ hours). I posted my findings to SuperMicro, and they > quickly released a v12 firmware for the SAS 3008 HBA, which as I > understand is now publicly available on their site. > > The story is long, and painful, but in the end the answer is to > absolutely make sure you are running v12 firmware on the SAS 3008 > cards. If you are not running v12 (or later) firmware on the SAS > 3008, your data is unstable! (at least with intel drives, probably > others as well) > > Thanks again for all the help/ideas from everyone. It's been a few > months now with zero errors reporting! > > -Kyle > > > On 6/7/16 6:39 PM, David Gwynne wrote: >>> On 8 Jun 2016, at 09:30, Steven Hartland >>> wrote: >>> >>> Oh another thing to test is iirc 3008 is supported by mrsas so you >>> might want to try adding the following into loader.conf to switch >>> drivers: >>> hw.mfi.mrsas_enable="1" >> i believe the 3008s can run two different firmwares, one that >> provides the mpt2 interface and the other than provides the megaraid >> sas fusion interface. you have to flash them to switch though, you >> cant just point a driver at it and hope for the best. >> >> each fw presents different pci ids. eg, in >> http://pciids.sourceforge.net/v2.2/pci.ids you can see: >> >> 005f MegaRAID SAS-3 3008 [Fury] >> 0097 SAS3008 PCI-Express Fusion-MPT SAS-3 >> >> dlg >> >>> On 07/06/2016 23:43, list-news wrote: >>>> No, it threw errors on both da6 and da7 and then I stopped it. >>>> >>>> Your last e-mail gave me thoughts though. I have a server with >>>> 2008 controllers (entirely different backplane design, cpu, memory, >>>> etc). I've moved the 4 drives to that and I'm running the test now. >>>> >>>> # uname = FreeBSD 10.2-RELEASE-p12 #1 r296215 >>>> # sysctl dev.mps.0 >>>> dev.mps.0.spinup_wait_time: 3 >>>> dev.mps.0.chain_alloc_fail: 0 >>>> dev.mps.0.enable_ssu: 1 >>>> dev.mps.0.max_chains: 2048 >>>> dev.mps.0.chain_free_lowwater: 1176 >>>> dev.mps.0.chain_free: 2048 >>>> dev.mps.0.io_cmds_highwater: 510 >>>> dev.mps.0.io_cmds_active: 0 >>>> dev.mps.0.driver_version: 20.00.00.00-fbsd >>>> dev.mps.0.firmware_version: 17.00.01.00 >>>> dev.mps.0.disable_msi: 0 >>>> dev.mps.0.disable_msix: 0 >>>> dev.mps.0.debug_level: 3 >>>> dev.mps.0.%parent: pci5 >>>> dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000 >>>> subdevice=0x3020 class=0x010700 >>>> dev.mps.0.%location: slot=0 function=0 >>>> dev.mps.0.%driver: mps >>>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 >>>> >>>> About 1.5 hours has passed at full load, no errors. >>>> >>>> gstat drive busy% seems to hang out around 30-40 instead of >>>> ~60-70. Overall throughput seems to be 20-30% less with my rough >>>> benchmarks. >>>> >>>> I'm not sure if this gets us closer to the answer, if this doesn't >>>> time-out on the 2008 controller, it looks like one of these: >>>> 1) The Intel drive firmware is being overloaded somehow when >>>> connected to the 3008. >>>> or >>>> 2) The 3008 firmware or driver has an issue reading drive >>>> responses, sporadically thinking the command timed-out (when maybe >>>> it really didn't). >>>> >>>> Puzzle pieces: >>>> A) Why does setting tags of 1 on drives connected to the 3008 fix >>>> the problem? >>>> B) With tags of 255. Why does postgres (and assuming a large fsync >>>> count), seem to cause the problem within minutes? While running >>>> other heavy i/o commands (zpool scrub, bonnie++, fio), all of which >>>> show similarly high or higher iops take hours to cause the problem >>>> (if ever). >>>> >>>> I'll let this continue to run to further test. >>>> >>>> Thanks again for all the help. >>>> >>>> -Kyle >>>> >>>> On 6/7/16 4:22 PM, Steven Hartland wrote: >>>>> Always da6? >>>>> >>>>> On 07/06/2016 21:19, list-news wrote: >>>>>> Sure Steve: >>>>>> >>>>>> # cat /boot/loader.conf | grep trim >>>>>> vfs.zfs.trim.enabled=0 >>>>>> >>>>>> # sysctl vfs.zfs.trim.enabled >>>>>> vfs.zfs.trim.enabled: 0 >>>>>> >>>>>> # uptime >>>>>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>>>>> >>>>>> # tail -f /var/log/messages: >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a >>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout >>>>>> cm 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, >>>>>> handle(0x0010) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>>>>> connector name ( ) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 >>>>>> allocated tm 0xfffffe0001322150 >>>>>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID >>>>>> 1 Aborting command 0xfffffe0001375580 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from >>>>>> mprsas_send_abort for target ID 16 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>>>>> command timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 >>>>>> target 16, handle(0x0010) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, >>>>>> connector name ( ) >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm >>>>>> 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: Flags: 1 >>>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm >>>>>> 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm >>>>>> 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm >>>>>> 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 >>>>>> 00 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc >>>>>> 804b scsi 0 state c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a >>>>>> 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed >>>>>> timedout cm 0xfffffe0001375580 ccb 0xfffff8039895f800 during >>>>>> recovery ioc 8048 scsi 0 state c (da6:mpr0:0:16:0): >>>>>> SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length >>>>>> 0 SMID 786 completed timedout cm 0xfffffe(da6:mpr0:0:16:0): >>>>>> WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >>>>>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 >>>>>> during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: >>>>>> Command timeout >>>>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 >>>>>> terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 >>>>>> abort TaskMID 1016 status 0x0 code 0x0 count 5 >>>>>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 >>>>>> finished recovery after aborting TaskMID 1016 >>>>>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>>>>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>>>>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>>>>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>>>>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>>>>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>>>>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>>>>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working >>>>>> on Event: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event >>>>>> Free: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working >>>>>> on Event: [1c] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event >>>>>> Free: [1c] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>>>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>>>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>>>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working >>>>>> on Event: [16] >>>>>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event >>>>>> Free: [16] >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE >>>>>> CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI >>>>>> Status Error >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check >>>>>> Condition >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT >>>>>> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>>>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command >>>>>> (per sense data) >>>>>> >>>>>> -Kyle >>>>>> >>>>>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using >>>>>>> BIO delete but assuming your running ZFS can you set the >>>>>>> following in loader.conf and see how you get on. >>>>>>> vfs.zfs.trim.enabled=0 >>>>>>> >>>>>>> Regards >>>>>>> Steve >>>>>> >>>>>> _______________________________________________ >>>>>> freebsd-scsi@freebsd.org mailing list >>>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>>>> To unsubscribe, send any mail to >>>>>> "freebsd-scsi-unsubscribe@freebsd.org" >>>>> _______________________________________________ >>>>> freebsd-scsi@freebsd.org mailing list >>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>>> To unsubscribe, send any mail to >>>>> "freebsd-scsi-unsubscribe@freebsd.org" >>>> >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to >>>> "freebsd-scsi-unsubscribe@freebsd.org" >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > > > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" From owner-freebsd-scsi@freebsd.org Thu Oct 27 12:04:57 2016 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 211FEC23C57 for ; Thu, 27 Oct 2016 12:04:57 +0000 (UTC) (envelope-from gothmog@confusticate.com) Received: from mail.confusticate.com (unknown [IPv6:2001:470:e465:3::25]) by mx1.freebsd.org (Postfix) with ESMTP id E56951E6 for ; Thu, 27 Oct 2016 12:04:56 +0000 (UTC) (envelope-from gothmog@confusticate.com) Received: from [IPv6:2001:470:e465:1:386a:f7a9:9f7e:88a7] (unknown [IPv6:2001:470:e465:1:386a:f7a9:9f7e:88a7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mail.confusticate.com (Postfix) with ESMTPSA id BF8CF132FF for ; Thu, 27 Oct 2016 08:04:48 -0400 (EDT) From: Jeremy Beker Content-Type: multipart/signed; boundary="Apple-Mail=_287FB176-443B-4ABB-9793-A87FD841F5AB"; protocol="application/pkcs7-signature"; micalg=sha1 Mime-Version: 1.0 (Mac OS X Mail 10.1 \(3251\)) Subject: FreeBSD 11.0 and LSI SAS3081E losing all devices Message-Id: Date: Thu, 27 Oct 2016 08:04:44 -0400 To: freebsd-scsi@freebsd.org X-Mailer: Apple Mail (2.3251) X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on bree.confusticate.com X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Oct 2016 12:04:57 -0000 --Apple-Mail=_287FB176-443B-4ABB-9793-A87FD841F5AB Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Good Morning! Since upgrading my home server from 10.3 to 11.0-RELEASE-p1 about a week = ago, I have twice had a serious problem where my LSI adapter is having = errors and dropping all the drives out of my ZFS pool. Hardware: - LSI SAS3081E-R PCI-E card with the IT firmware loaded=20 - 6x2TB WD Black drives - 1 SSD - Supermicro X10SLL-F MB (not sure that is relevant)=20 This system has been running with this exact hardware for about a year = with no problems under the 10.X versions of FreeBSD. Last weekend, I = upgraded the system to 11.0-RELEASE-p1. Since then, twice, all of the = drives have been marked as unavailable to ZFS after generating a stream = of errors. The problems start with a number of errors like this: Oct 26 03:28:29 rivendell kernel: mpt0: request 0xfffffe0000f73058:57643 = timed out for ccb 0xfffff803456ea000 (req->ccb 0xfffff803456ea000)=20 Oct 26 03:28:29 rivendell kernel: mpt0: attempting to abort req = 0xfffffe0000f73058:57643 function 0=20 Oct 26 03:28:29 rivendell kernel: mpt0: completing timedout/aborted req = 0xfffffe0000f73058:57643=20 Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 = 00 04 c4 91 c0 00 00 08 00=20 Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): CAM status: CCB = request terminated by the host=20 Oct 26 03:28:29 rivendell kernel: (da0:mpt0:0:10:0): mpt0: Retrying = command=20 Oct 26 03:28:29 rivendell kernel: abort of req 0xfffffe0000f73058:0 = completed=20 Oct 26 03:28:49 rivendell kernel: mpt0: request 0xfffffe0000f6c3b0:57658 = timed out for ccb 0xfffff803456ea000 (req->ccb 0xfffff803456ea000)=20 Oct 26 03:28:49 rivendell kernel: mpt0: attempting to abort req = 0xfffffe0000f6c3b0:57658 function 0=20 Oct 26 03:28:49 rivendell kernel: mpt0: completing timedout/aborted req = 0xfffffe0000f6c3b0:57658=20 Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 = 00 04 c4 91 c0 00 00 08 00=20 Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): CAM status: CCB = request terminated by the host=20 Oct 26 03:28:49 rivendell kernel: (da0:mpt0:0:10:0): Retrying command=20 Oct 26 03:28:49 rivendell kernel: mpt0: abort of req = 0xfffffe0000f6c3b0:0 completed=20 Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): READ(10). CDB: 28 = 00 04 c4 91 c0 00 00 08 00=20 Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): CAM status: SCSI = Status Error=20 Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): SCSI status: Check = Condition=20 Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)=20 Oct 26 03:28:51 rivendell kernel: (da0:mpt0:0:10:0): Retrying command = (per sense data)=20 Also these: Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): CAM status: SCSI = Status Error Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SCSI status: Check = Condition Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): Error 6, Retries = exhausted Oct 26 03:29:55 rivendell kernel: (da1:mpt0:0:14:0): Invalidating pack After a bunch of rounds of the errors above, I get this: Oct 26 03:35:17 rivendell kernel: mpt0: request 0xfffffe0000f73350:62027 = timed out for ccb 0xfffff800160ce000 (req->ccb 0xfffff800160ce000) Oct 26 03:35:17 rivendell kernel: mpt0: attempting to abort req = 0xfffffe0000f73350:62027 function 0 Oct 26 03:35:18 rivendell kernel: mpt0: mpt_wait_req(1) timed out Oct 26 03:35:18 rivendell kernel: mpt0: mpt_recover_commands: abort = timed-out. Resetting controller Oct 26 03:35:18 rivendell kernel: mpt0: mpt_cam_event: 0x0 Oct 26 03:35:18 rivendell kernel: mpt0: mpt_cam_event: 0x0 Oct 26 03:35:18 rivendell kernel: mpt0: completing timedout/aborted req = 0xfffffe0000f73350:62027 After which all the drives seem to disappear and the system detaches all = of them: Oct 26 03:35:33 rivendell kernel: da1 at mpt0 bus 0 scbus0 target 14 lun = 0 Oct 26 03:35:33 rivendell kernel: da1: s/n = WD-WMAY01559141 detached Oct 26 03:35:33 rivendell kernel: da2 at mpt0 bus 0 scbus0 target 15 lun = 0 Oct 26 03:35:33 rivendell kernel: da2: s/n = WD-WMAY01603430 detached Oct 26 03:35:33 rivendell kernel: da5 at mpt0 bus 0 scbus0 target 18 lun = 0 Oct 26 03:35:33 rivendell kernel: da5: s/n = WD-WMAY01159727 detached Oct 26 03:35:33 rivendell kernel: da6 at mpt0 bus 0 scbus0 target 19 lun = 0 Oct 26 03:35:33 rivendell kernel: da6: s/n = WD-WMAY02971691 detached Oct 26 03:35:33 rivendell kernel: da4 at mpt0 bus 0 scbus0 target 17 lun = 0 Oct 26 03:35:33 rivendell kernel: da4: s/n = WD-WMAY01470856 detached Oct 26 03:35:33 rivendell kernel: da3 at mpt0 bus 0 scbus0 target 16 lun = 0 Oct 26 03:35:33 rivendell kernel: da3: s/n = WD-WMAY01602648 detached At this point I have had to reboot the server and then all the drives = magically reappear. Any help would be greatly appreciated. -Jeremy --=20 Jeremy Beker - @gothmog=20 http://www.confusticate.com Condensing fact from the vapor of nuance. --Apple-Mail=_287FB176-443B-4ABB-9793-A87FD841F5AB Content-Disposition: attachment; filename=smime.p7s Content-Type: application/pkcs7-signature; name=smime.p7s Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAoIIK8jCCBQgw ggPwoAMCAQICED7waKcRDjPZchDYp4xW7X0wDQYJKoZIhvcNAQELBQAwdTELMAkGA1UEBhMCSUwx FjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4xKTAnBgNVBAsTIFN0YXJ0Q29tIENlcnRpZmljYXRpb24g QXV0aG9yaXR5MSMwIQYDVQQDExpTdGFydENvbSBDbGFzcyAxIENsaWVudCBDQTAeFw0xNjAzMjcx MjA0MjVaFw0xNzAzMjcxMjA0MjVaMEwxITAfBgNVBAMMGGdvdGhtb2dAY29uZnVzdGljYXRlLmNv bTEnMCUGCSqGSIb3DQEJARYYZ290aG1vZ0Bjb25mdXN0aWNhdGUuY29tMIIBIjANBgkqhkiG9w0B AQEFAAOCAQ8AMIIBCgKCAQEA3RoESoAhdajTxi3KVNa8fnM9blHxqbylwHh9bDQ3A+w5xguZlOxg pLAJSczpLVGRilU/e6UlRzgXCaRhEFIv6rb5czqxqq+Aktvus9uY99Q+vCU/LbnutPeF/X0Hr01E ff+Ts+wVBVjnj1vuvW1x/lSzTGKCVsuYhvOb5ULXTTp/OLpRJhprpZXJCmJ+6LQftykLBR/fhyL9 jIEPAxa7JV64VkYk/qANeX29j36y1W8+J5CV2egwrrXpOnIOsY15K00eHIoNcRiXJnR0LfDST8eT dUVQWjBA5gzbTGs96hlS2EQ3Dz3jSZc2CsdM5k8rgzkdBXwzpvz6kbWNE8Q5ywIDAQABo4IBuzCC AbcwDgYDVR0PAQH/BAQDAgSwMB0GA1UdJQQWMBQGCCsGAQUFBwMCBggrBgEFBQcDBDAJBgNVHRME AjAAMB0GA1UdDgQWBBRaB+bmgn7KdjjHK5ZnNUVBAuycKzAfBgNVHSMEGDAWgBQkgWw5Yb5JD4+3 G0YrySi1J0htaDBvBggrBgEFBQcBAQRjMGEwJAYIKwYBBQUHMAGGGGh0dHA6Ly9vY3NwLnN0YXJ0 c3NsLmNvbTA5BggrBgEFBQcwAoYtaHR0cDovL2FpYS5zdGFydHNzbC5jb20vY2VydHMvc2NhLmNs aWVudDEuY3J0MDgGA1UdHwQxMC8wLaAroCmGJ2h0dHA6Ly9jcmwuc3RhcnRzc2wuY29tL3NjYS1j bGllbnQxLmNybDAjBgNVHREEHDAagRhnb3RobW9nQGNvbmZ1c3RpY2F0ZS5jb20wIwYDVR0SBBww GoYYaHR0cDovL3d3dy5zdGFydHNzbC5jb20vMEYGA1UdIAQ/MD0wOwYLKwYBBAGBtTcBAgQwLDAq BggrBgEFBQcCARYeaHR0cDovL3d3dy5zdGFydHNzbC5jb20vcG9saWN5MA0GCSqGSIb3DQEBCwUA A4IBAQACW4t9PdRYwzKMfSdGBlBhkcd+OAF8lHT3Jh/FYgRVrkkPvEh7SIPa7wPKuzwf9hFjhxPE zyG264lW1WNyMbD3Hl4Djwu8tXPNjW1nxXO3iRIA9acqpvivp8SCIWoO5AigAm8G6KEIQS3rYPV+ q28YEziMoRGvb+seEBQCYANxRtEVTaQfYA3iOezKiYmftC+EXT/J3AqerQD7v9+kyloZ62OhHgof yAvXeVY7sK8BmG1h9LDPQgxDVwW1JRQJmw6WHVu2twj3W+DTTmEjZM9F8XqNvScaZvPhSx7ZIkvU bNo7rK5O+05825BkqJwgrwuhXS7utuBA3Gr6UYz9fdxQMIIF4jCCA8qgAwIBAgIQa6eKfQrXiNZR CvlZ5Oe04TANBgkqhkiG9w0BAQsFADB9MQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20g THRkLjErMCkGA1UECxMiU2VjdXJlIERpZ2l0YWwgQ2VydGlmaWNhdGUgU2lnbmluZzEpMCcGA1UE AxMgU3RhcnRDb20gQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkwHhcNMTUxMjE2MDEwMDA1WhcNMzAx MjE2MDEwMDA1WjB1MQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjEpMCcGA1UE CxMgU3RhcnRDb20gQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkxIzAhBgNVBAMTGlN0YXJ0Q29tIENs YXNzIDEgQ2xpZW50IENBMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAvX3a98OifYP2 W4L921tfrh4bdcC1Ga+YJKy7V3nYNewJHnzMlBsK0Hb8Dm4Wo3FZpylcYa1MJGT10QMGWaLER3xC IuRR+8eklf/EqeZWRLojJ7zBRtjMywPOCelrOU+DX12dKp+Ez4J6919rz1UudTO1GvZyCYJ/I706 2uHsskM8b7gPxmcCoO1UHwwpgkvpCArJWGFoFzjLdsZbErJcS3HtAhlkbE/BKTMrdYg35Uo12SLB O5tbk8h2imbKTC8iMs+pskrvI/AVlh6QoTTXk6xboVX6zgMgzxSVVLymQiygYYm0y5aMsvi2raFh C643SOGvErWWPPnSEfbeAD1xswIDAQABo4IBZDCCAWAwDgYDVR0PAQH/BAQDAgEGMB0GA1UdJQQW MBQGCCsGAQUFBwMCBggrBgEFBQcDBDASBgNVHRMBAf8ECDAGAQH/AgEAMDIGA1UdHwQrMCkwJ6Al oCOGIWh0dHA6Ly9jcmwuc3RhcnRzc2wuY29tL3Nmc2NhLmNybDBmBggrBgEFBQcBAQRaMFgwJAYI KwYBBQUHMAGGGGh0dHA6Ly9vY3NwLnN0YXJ0c3NsLmNvbTAwBggrBgEFBQcwAoYkaHR0cDovL2Fp YS5zdGFydHNzbC5jb20vY2VydHMvY2EuY3J0MB0GA1UdDgQWBBQkgWw5Yb5JD4+3G0YrySi1J0ht aDAfBgNVHSMEGDAWgBROC+8apEBbpRdphzDKNGhD0EGu8jA/BgNVHSAEODA2MDQGBFUdIAAwLDAq BggrBgEFBQcCARYeaHR0cDovL3d3dy5zdGFydHNzbC5jb20vcG9saWN5MA0GCSqGSIb3DQEBCwUA A4ICAQCL4/eH7AGLhK0PAQJbnOEjJyMEvTTwcAJuUh/bodjQl06u4putYOxdSyIjSP/sKt+31Lmj G8+IO1WqykE4H/Lm7NKezWVnCHuwb3ptgFmlwbMbGkU2MOZBtwzfKXdYUhFLhaE2uw5jXhXvLYit Qay962wP5uPI6eAIhV4L8aaya1u4s7MnrTq0Rz25FuGNO79vTHYWj797tSRC8rM16js4yGKOLFpQ vIg0F8IElv57b1stp+C7omqM5Qn15dePbSnqr8Jb65WtmJJbnv6rlqfY/aLuE/zmNAlzLmPgfMDS tKIXdg+EoYBZTEo8wBUaBxihfNbJ069ndQOxMNNqBelEMgpAtmjTbCuXFjqIwWq+XOx6ZV/Wh2FA maLsSHlNvEjjSQMZwE4EeHCdo66ZmEs/5JYlCeOkulKVQ6P3m5/XOj2jP17Q2AgmjP+11+sHN7Pv rG0OwrQp9QMe3X+rn0G8MjtFfqBWvR9CgLIxzM3MJNxFdgdjS2rYnShP5uxvqwfZvhZVYCIkqdJh pYON0DvSodfiar0wiM79mySZJjzC0CTbiisBzS/BeBhqeo2wFfli/iw3hn1XKvAx0ty6w/scmBF0 AYqmRHYj1TjMSw0lAl7AztLglqWjUPI+sukvadMRPxmtKXlS2nVR4an/Z16imsZ69+fFYH68c1CK 7zmjozGCA04wggNKAgEBMIGJMHUxCzAJBgNVBAYTAklMMRYwFAYDVQQKEw1TdGFydENvbSBMdGQu MSkwJwYDVQQLEyBTdGFydENvbSBDZXJ0aWZpY2F0aW9uIEF1dGhvcml0eTEjMCEGA1UEAxMaU3Rh cnRDb20gQ2xhc3MgMSBDbGllbnQgQ0ECED7waKcRDjPZchDYp4xW7X0wCQYFKw4DAhoFAKCCAZkw GAYJKoZIhvcNAQkDMQsGCSqGSIb3DQEHATAcBgkqhkiG9w0BCQUxDxcNMTYxMDI3MTIwNDQ1WjAj BgkqhkiG9w0BCQQxFgQUv0S3XyAfgdn1kEvOP4dP/ZQJ01AwgZoGCSsGAQQBgjcQBDGBjDCBiTB1 MQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjEpMCcGA1UECxMgU3RhcnRDb20g Q2VydGlmaWNhdGlvbiBBdXRob3JpdHkxIzAhBgNVBAMTGlN0YXJ0Q29tIENsYXNzIDEgQ2xpZW50 IENBAhA+8GinEQ4z2XIQ2KeMVu19MIGcBgsqhkiG9w0BCRACCzGBjKCBiTB1MQswCQYDVQQGEwJJ TDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjEpMCcGA1UECxMgU3RhcnRDb20gQ2VydGlmaWNhdGlv biBBdXRob3JpdHkxIzAhBgNVBAMTGlN0YXJ0Q29tIENsYXNzIDEgQ2xpZW50IENBAhA+8GinEQ4z 2XIQ2KeMVu19MA0GCSqGSIb3DQEBAQUABIIBAJVHr1/s9yLKLSZ1LFfG+MS8cVXaE9D6tcak09oB rJ2ya2XOWINF3B77b0z/CGJZh1Y6l0uGoEdxB9pFjA0KTv/oc7VOgOhPyHS9wrMgxoVH+UALmgf4 Gp8LNxAwiGMVCsLfB/5l38uAc45+DbrQ1PXdHIrNchjfNdx7yByPK/Ru98ygzgyNvjcN5oXdDalf d/ue4dk6GLwZvLjR4yX0kFlEXskVWDfoNS/+xCVlR/5bTpAq5TfRYkmsNhM4TmbJPRT8jIUFucz2 1DeIq4uaKOCuWf+6baGb9WrhxLnexjJ0lKG+nkaZ8VYGKIZZ0Glh9sjnkinJ1YAPhmjv1o5gcMUA AAAAAAA= --Apple-Mail=_287FB176-443B-4ABB-9793-A87FD841F5AB--