Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 8 Jun 2016 09:39:38 +1000
From:      David Gwynne <david@gwynne.id.au>
To:        Steven Hartland <killing@multiplay.co.uk>
Cc:        freebsd-scsi@freebsd.org
Subject:   Re: Avago LSI SAS 3008 & Intel SSD Timeouts
Message-ID:  <4C234AB2-80E5-49A3-B5BB-24F425AFF067@gwynne.id.au>
In-Reply-To: <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>
References:  <30c04d8b-80cb-c637-26dc-97caebad3acb@mindpackstudios.com> <b30f968c-cc41-f7de-5a54-35bed961e65a@multiplay.co.uk> <08C01646-9AF3-4E89-A545-C051A284E039@sarenet.es> <986e03a7-5dc8-f5e0-5a17-4bf49459f905@mindpackstudios.com> <2823D96D-881D-4D40-B610-FC8292FA2FC5@sarenet.es> <4072b65d-25d4-2a79-5911-573517b0ee57@mindpackstudios.com> <583dddc6-4614-9900-88f7-27347866d7aa@mindpackstudios.com> <331da785-c88b-d74e-512a-37bdb618d512@multiplay.co.uk> <d8c3284c-97aa-7ae0-48e2-2d6b3e5dcf39@mindpackstudios.com> <94380b81-fcd7-511c-bc35-b8c5459d2ea4@multiplay.co.uk> <99b3b075-3158-29aa-3a33-311594fb9270@mindpackstudios.com> <7e6e7b15-7500-01a5-006e-65a3131b5c17@multiplay.co.uk>

next in thread | previous in thread | raw e-mail | index | archive | help

> On 8 Jun 2016, at 09:30, Steven Hartland <killing@multiplay.co.uk> =
wrote:
>=20
> Oh another thing to test is iirc 3008 is supported by mrsas so you =
might want to try adding the following into loader.conf to switch =
drivers:
> hw.mfi.mrsas_enable=3D"1"

i believe the 3008s can run two different firmwares, one that provides =
the mpt2 interface and the other than provides the megaraid sas fusion =
interface. you have to flash them to switch though, you cant just point =
a driver at it and hope for the best.

each fw presents different pci ids. eg, in =
http://pciids.sourceforge.net/v2.2/pci.ids you can see:

	005f  MegaRAID SAS-3 3008 [Fury]
	0097  SAS3008 PCI-Express Fusion-MPT SAS-3

dlg

>=20
> On 07/06/2016 23:43, list-news wrote:
>> No, it threw errors on both da6 and da7 and then I stopped it.
>>=20
>> Your last e-mail gave me thoughts though.  I have a server with 2008 =
controllers (entirely different backplane design, cpu, memory, etc).  =
I've moved the 4 drives to that and I'm running the test now.
>>=20
>> # uname =3D FreeBSD 10.2-RELEASE-p12 #1 r296215
>> # sysctl dev.mps.0
>> dev.mps.0.spinup_wait_time: 3
>> dev.mps.0.chain_alloc_fail: 0
>> dev.mps.0.enable_ssu: 1
>> dev.mps.0.max_chains: 2048
>> dev.mps.0.chain_free_lowwater: 1176
>> dev.mps.0.chain_free: 2048
>> dev.mps.0.io_cmds_highwater: 510
>> dev.mps.0.io_cmds_active: 0
>> dev.mps.0.driver_version: 20.00.00.00-fbsd
>> dev.mps.0.firmware_version: 17.00.01.00
>> dev.mps.0.disable_msi: 0
>> dev.mps.0.disable_msix: 0
>> dev.mps.0.debug_level: 3
>> dev.mps.0.%parent: pci5
>> dev.mps.0.%pnpinfo: vendor=3D0x1000 device=3D0x0072 subvendor=3D0x1000 =
subdevice=3D0x3020 class=3D0x010700
>> dev.mps.0.%location: slot=3D0 function=3D0
>> dev.mps.0.%driver: mps
>> dev.mps.0.%desc: Avago Technologies (LSI) SAS2008
>>=20
>> About 1.5 hours has passed at full load, no errors.
>>=20
>> gstat drive busy% seems to hang out around 30-40 instead of ~60-70.  =
Overall throughput seems to be 20-30% less with my rough benchmarks.
>>=20
>> I'm not sure if this gets us closer to the answer, if this doesn't =
time-out on the 2008 controller, it looks like one of these:
>> 1) The Intel drive firmware is being overloaded somehow when =
connected to the 3008.
>> or
>> 2) The 3008 firmware or driver has an issue reading drive responses, =
sporadically thinking the command timed-out (when maybe it really =
didn't).
>>=20
>> Puzzle pieces:
>> A) Why does setting tags of 1 on drives connected to the 3008 fix the =
problem?
>> B) With tags of 255.  Why does postgres (and assuming a large fsync =
count), seem to cause the problem within minutes?  While running other =
heavy i/o commands (zpool scrub, bonnie++, fio), all of which show =
similarly high or higher iops take hours to cause the problem (if ever).
>>=20
>> I'll let this continue to run to further test.
>>=20
>> Thanks again for all the help.
>>=20
>> -Kyle
>>=20
>> On 6/7/16 4:22 PM, Steven Hartland wrote:
>>> Always da6?
>>>=20
>>> On 07/06/2016 21:19, list-news wrote:
>>>> Sure Steve:
>>>>=20
>>>> # cat /boot/loader.conf | grep trim
>>>> vfs.zfs.trim.enabled=3D0
>>>>=20
>>>> # sysctl vfs.zfs.trim.enabled
>>>> vfs.zfs.trim.enabled: 0
>>>>=20
>>>> # uptime
>>>> 3:14PM  up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07
>>>>=20
>>>> # tail -f /var/log/messages:
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a =
00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm =
0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, =
connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580 =
allocated tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1 =
Aborting command 0xfffffe0001375580
>>>> Jun  7 15:13:50 s18 kernel: mpr0: Sending reset from =
mprsas_send_abort for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command =
timeout cm 0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, =
handle(0x0010)
>>>> Jun  7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8, =
connector name (    )
>>>> Jun  7 15:13:50 s18 kernel: mpr0: queued timedout cm =
0xfffffe00013627a0 for processing by tm 0xfffffe0001322150
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 1<InProgress>
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Started
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm =
0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm =
0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0b 43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0a 25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm =
0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi =
0 state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 =
0a 25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0 =
state c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a =
00 2b d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm =
0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi =
0 state c    (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 =
00 00 00 00 00 00 length 0 SMID 786 completed timedout cm =
0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
>>>> Jun  7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 =
during recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command =
timeout
>>>> Jun  7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 =
terminated ioc 804b scsi 0 sta(da6:te c xfer 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 =
abort TaskMID 1016 status 0x0 code 0x0 count 5
>>>> Jun  7 15:13:50 s18 kernel: 16:    (xpt0:mpr0:0:16:0): SMID 1 =
finished recovery after aborting TaskMID 1016
>>>> Jun  7 15:13:50 s18 kernel: 0): mpr0: Retrying command
>>>> Jun  7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 4
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: EnclosureHandle: 0x2
>>>> Jun  7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
>>>> Jun  7 15:13:50 s18 kernel: NumPhys: 31
>>>> Jun  7 15:13:50 s18 kernel: NumEntries: 1
>>>> Jun  7 15:13:50 s18 kernel: StartPhyNum: 8
>>>> Jun  7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
>>>> Jun  7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on  =
Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: =
[16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on  =
Event: [1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: =
[1c]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: EventReply    :
>>>> Jun  7 15:13:50 s18 kernel: EventDataLength: 2
>>>> Jun  7 15:13:50 s18 kernel: AckRequired: 0
>>>> Jun  7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
>>>> Jun  7 15:13:50 s18 kernel: EventContext: 0x0
>>>> Jun  7 15:13:50 s18 kernel: Flags: 0
>>>> Jun  7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
>>>> Jun  7 15:13:50 s18 kernel: PhysicalPort: 0
>>>> Jun  7 15:13:50 s18 kernel: DiscoveryStatus: 0
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on  =
Event: [16]
>>>> Jun  7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: =
[16]
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE =
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI =
Status Error
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check =
Condition
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT =
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
>>>> Jun  7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command =
(per sense data)
>>>>=20
>>>> -Kyle
>>>>=20
>>>> On 6/7/16 2:53 PM, Steven Hartland wrote:
>>>>> CDB: 85 is a TRIM command IIRC, I know you tried it before using =
BIO delete but assuming your running ZFS can you set the following in =
loader.conf and see how you get on.
>>>>> vfs.zfs.trim.enabled=3D0
>>>>>=20
>>>>>    Regards
>>>>>    Steve
>>>>=20
>>>>=20
>>>> _______________________________________________
>>>> freebsd-scsi@freebsd.org mailing list
>>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>>=20
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>>=20
>>=20
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"
>=20
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to =
"freebsd-scsi-unsubscribe@freebsd.org"




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C234AB2-80E5-49A3-B5BB-24F425AFF067>