Date: Wed, 8 Jun 2016 09:39:38 +1000 From: David Gwynne <david@gwynne.id.au> To: Steven Hartland <killing@multiplay.co.uk> Cc: freebsd-scsi@freebsd.org Subject: Re: Avago LSI SAS 3008 & Intel SSD Timeouts Message-ID: <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au> In-Reply-To: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk> References: <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 8 Jun 2016, at 09:30, Steven Hartland <killing@multiplay.co.uk> = wrote: >=20 > Oh another thing to test is iirc 3008 is supported by mrsas so you = might want to try adding the following into loader.conf to switch = drivers: > hw.mfi.mrsas_enable=3D"1" i believe the 3008s can run two different firmwares, one that provides = the mpt2 interface and the other than provides the megaraid sas fusion = interface. you have to flash them to switch though, you cant just point = a driver at it and hope for the best. each fw presents different pci ids. eg, in = http://pciids.sourceforge.net/v2.2/pci.ids you can see: 005f MegaRAID SAS-3 3008 [Fury] 0097 SAS3008 PCI-Express Fusion-MPT SAS-3 dlg >=20 > On 07/06/2016 23:43, list-news wrote: >> No, it threw errors on both da6 and da7 and then I stopped it. >>=20 >> Your last e-mail gave me thoughts though. I have a server with 2008 = controllers (entirely different backplane design, cpu, memory, etc). = I've moved the 4 drives to that and I'm running the test now. >>=20 >> # uname =3D FreeBSD 10.2-RELEASE-p12 #1 r296215 >> # sysctl dev.mps.0 >> dev.mps.0.spinup_wait_time: 3 >> dev.mps.0.chain_alloc_fail: 0 >> dev.mps.0.enable_ssu: 1 >> dev.mps.0.max_chains: 2048 >> dev.mps.0.chain_free_lowwater: 1176 >> dev.mps.0.chain_free: 2048 >> dev.mps.0.io_cmds_highwater: 510 >> dev.mps.0.io_cmds_active: 0 >> dev.mps.0.driver_version: 20.00.00.00-fbsd >> dev.mps.0.firmware_version: 17.00.01.00 >> dev.mps.0.disable_msi: 0 >> dev.mps.0.disable_msix: 0 >> dev.mps.0.debug_level: 3 >> dev.mps.0.%parent: pci5 >> dev.mps.0.%pnpinfo: vendor=3D0x1000 device=3D0x0072 subvendor=3D0x1000 = subdevice=3D0x3020 class=3D0x010700 >> dev.mps.0.%location: slot=3D0 function=3D0 >> dev.mps.0.%driver: mps >> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008 >>=20 >> About 1.5 hours has passed at full load, no errors. >>=20 >> gstat drive busy% seems to hang out around 30-40 instead of ~60-70. = Overall throughput seems to be 20-30% less with my rough benchmarks. >>=20 >> I'm not sure if this gets us closer to the answer, if this doesn't = time-out on the 2008 controller, it looks like one of these: >> 1) The Intel drive firmware is being overloaded somehow when = connected to the 3008. >> or >> 2) The 3008 firmware or driver has an issue reading drive responses, = sporadically thinking the command timed-out (when maybe it really = didn't). >>=20 >> Puzzle pieces: >> A) Why does setting tags of 1 on drives connected to the 3008 fix the = problem? >> B) With tags of 255. Why does postgres (and assuming a large fsync = count), seem to cause the problem within minutes? While running other = heavy i/o commands (zpool scrub, bonnie++, fio), all of which show = similarly high or higher iops take hours to cause the problem (if ever). >>=20 >> I'll let this continue to run to further test. >>=20 >> Thanks again for all the help. >>=20 >> -Kyle >>=20 >> On 6/7/16 4:22 PM, Steven Hartland wrote: >>> Always da6? >>>=20 >>> On 07/06/2016 21:19, list-news wrote: >>>> Sure Steve: >>>>=20 >>>> # cat /boot/loader.conf | grep trim >>>> vfs.zfs.trim.enabled=3D0 >>>>=20 >>>> # sysctl vfs.zfs.trim.enabled >>>> vfs.zfs.trim.enabled: 0 >>>>=20 >>>> # uptime >>>> 3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07 >>>>=20 >>>> # tail -f /var/log/messages: >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a = 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm = 0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010) >>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, = connector name ( ) >>>> Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 = allocated tm 0xfffffe0001322150 >>>> Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 = Aborting command 0xfffffe0001375580 >>>> Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from = mprsas_send_abort for target ID 16 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command = timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, = handle(0x0010) >>>> Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, = connector name ( ) >>>> Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm = 0xfffffe00013627a0 for processing by tm 0xfffffe0001322150 >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: Flags: 1<InProgress> >>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm = 0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm = 0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm = 0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi = 0 state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 = 0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 = state c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a = 00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm = 0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi = 0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 = 00 00 00 00 00 00 length 0 SMID 786 completed timedout cm = 0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00 >>>> Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 = during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command = timeout >>>> Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 = terminated ioc 804b scsi 0 sta(da6:te c xfer 0 >>>> Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 = abort TaskMID 1016 status 0x0 code 0x0 count 5 >>>> Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 = finished recovery after aborting TaskMID 1016 >>>> Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command >>>> Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16 >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 4 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2 >>>> Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9 >>>> Jun 7 15:13:50 s18 kernel: NumPhys: 31 >>>> Jun 7 15:13:50 s18 kernel: NumEntries: 1 >>>> Jun 7 15:13:50 s18 kernel: StartPhyNum: 8 >>>> Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3) >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010 >>>> Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb) >>>> Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange >>>> Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on = Event: [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: = [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on = Event: [1c] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: = [1c] >>>> Jun 7 15:13:50 s18 kernel: mpr0: EventReply : >>>> Jun 7 15:13:50 s18 kernel: EventDataLength: 2 >>>> Jun 7 15:13:50 s18 kernel: AckRequired: 0 >>>> Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16) >>>> Jun 7 15:13:50 s18 kernel: EventContext: 0x0 >>>> Jun 7 15:13:50 s18 kernel: Flags: 0 >>>> Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete >>>> Jun 7 15:13:50 s18 kernel: PhysicalPort: 0 >>>> Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0 >>>> Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on = Event: [16] >>>> Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: = [16] >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE = CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI = Status Error >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check = Condition >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT = ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) >>>> Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command = (per sense data) >>>>=20 >>>> -Kyle >>>>=20 >>>> On 6/7/16 2:53 PM, Steven Hartland wrote: >>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using = BIO delete but assuming your running ZFS can you set the following in = loader.conf and see how you get on. >>>>> vfs.zfs.trim.enabled=3D0 >>>>>=20 >>>>> Regards >>>>> Steve >>>>=20 >>>>=20 >>>> _______________________________________________ >>>> freebsd-scsi@freebsd.org mailing list >>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>>=20 >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >>=20 >>=20 >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org" >=20 > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to = "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C234AB2-80E5-49A3-B5BB-24F425AFF067>