Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jul 2018 12:24:21 +0200
From:      Oliver Sech <crimsonthunder@gmx.net>
To:        Stephen Mcconnell <stephen.mcconnell@broadcom.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   Re: problems with SAS JBODs 2
Message-ID:  <77b55ca6-25ce-3b26-e2f6-b0702a49ab28@gmx.net>
In-Reply-To: <0f26466617df38fd998dc87948b27273@mail.gmail.com>
References:  <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01> <CAOtMX2h8r31AeNCKyckK2P0VLn1CKFogo9bWom2So1x2ngpa4A@mail.gmail.com> <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <b785fe02-9242-c95f-56cb-2130f90e17f5@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> <6e0b8652-f227-271e-aeb4-a868ba6b90e2@gmx.net> <530b3e8e-4d76-e601-dd74-0ab6a06ebe25@gmx.net> <0f26466617df38fd998dc87948b27273@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
I ran the clear_dpm.sh script and changed the value you suggested. Rebooted and retested. As far as I can tell there is no difference.

I tried the menu option (99.  Reset port) in lsiutil and this helps with missing devices. After reseting the port I get all my disks and ses devs again. 

Read NVRAM or current values?  [0=NVRAM, 1=Current, default is 0] 

0000 : 21080600
0004 : 00000001
0008 : 00180080
000c : 00000001
0010 : 00000000
0014 : 00000000

On 07/24/2018 10:22 PM, Stephen Mcconnell wrote:
> Oliver, can you try changing the mapping mode on the controller? I think
> you're using Enclosure/Slot Mapping and I want to see what happens with
> Device Persistent Mapping. To do that, follow these steps:
> 1.	Run Ken’s script to clear the DPM entries
> 2.	Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page
> Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
> Enclosure/Slot Mapping and I'd like you to change this. You will be asked if
> you want to make changes. Select ‘yes’ and then change offset 0x0C to
> 00000001 (you might have to type C instead of 0x0C for the offset). Just use
> the default setting to change NVRAM.
> 3.	Reboot and see what happens and let me know how it goes.
> 
> 
> Steve
> 
>> -----Original Message-----
>> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
>> scsi@freebsd.org] On Behalf Of Oliver Sech
>> Sent: Tuesday, July 24, 2018 12:23 PM
>> To: FreeBSD-scsi
>> Subject: Re: problems with SAS JBODs 2
>>
>> update 2: I continued to test with more and different hardware.
>>
>> tested with a LSI SAS9207-8e HBA:
>> * after disconnect all devices properly disappear /dev/daX /dev/ses
>> no rescans or writing necessary
>> * no more targets in mpsutil (not mprutil)
>> * after reconnect all disks and all ses devs appear!
>>
>> tested with hardware raid LSI SAS 9286CV-8e
>> * no problems with the shelf/sas in different configurations
>> * switching the controller and importing configuration works reliably
>>
>> So far I think there is a problem with the mpr driver and I'm quite
>> confident
>> that it does affect other people.
>> With a simple configuration is probably not immediately noticeable as
>> everything seems to work after the first connect/boot.
>> It probably gets scarier for people with multipathing and big SAS chains I
>> guess...
>>
>> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
>> there is
>> anything I can help with while I still have hardware in the lab let me
>> know.
>>
>> Oliver
>>
>> On 07/23/2018 04:14 PM, Oliver Sech wrote:
>>> Sorry for the delay. I moved to a different office and could not focus
>>> on
>> this issue last week.
>>>
>>> I tested all of the hardware with different drivers and firmware on
>>> Linux to
>> make sure this is not a hardware problem:
>>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
>>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
>>> * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disks
>> after reconnect)
>>> * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disks
>> after reconnect)
>>>
>>> I tested a different HBA with an old firmware as well and there were no
>> issues. Only with the latest FW disks are missing after a reconnect with
>> the
>> error "mpt3sas_cm0: "device is not present handle"
>>> I don't know yet how different Firmware behaves between version
>> 09.00.000.00 and 16...
>>>
>>> Additional Info/Changes:
>>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
>> Change
>>> * "camcontrol rescan all" removes the devices that are still present
>>> after
>> the cable has been removed. "camcontrol devlist -v" does not show them
>> anymore
>>>
>>>
>>> Setting the driver "use_phy_num" to 0 and using the clearDPM script
>> between connects does not help. In fact I do not see a different behavior
>> at
>> all?
>>> I reflashed the controller multiple times and erased everything except
>>> the
>> "manufacturing" area to make sure that no previous settings are kept.
>>> The only thing I know that "fixes" the missing drives is to reboot the
>>> server.
>>>
>>> A (similar?) problem also occurs once I start the server with all 6 disk
>> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
>> properly with 5 shelves, once I offline connect the 6th shelve, then some
>> random disks are missing and I cannot longer import the ZFS pool.
>>>
>>> The following logs were collected with the very old FW 09.00.101.00 that
>> worked on Linux.
>>> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0
>>>
>>> best regards,
>>> Oliver
>>>
>>> On 07/12/2018 03:38 PM, Ken Merry wrote:
>>>>
>>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder@gmx.net>
>> wrote:
>>>>>
>>>>> On 07/11/2018 10:35 PM, Ken Merry wrote:
>>>>>> Oliver, what happens when you try to do I/O to the devices that don’t
>> go away after you pull the cable?  Does that cause the devices to go away?
>>>>>
>>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least
>>>>> the
>> "da" device disappears.
>>>>
>>>> Ok, that’s good.  Can you send the dmesg output and check with
>> ‘camcontrol devlist -v’ to make sure the device has fully gone away?
>>>>
>>>> The reason I ask is that I have spent lots of time over the years
>>>> debugging
>> device arrival and departure problems in CAM, GEOM and devfs, and I want
>> to make sure we aren’t running into any non-SAS related problems.
>>>>
>>>>>
>>>>>> Looking at the mprutil output, it also shows the devices sticking
>>>>>> around
>> from the adapter’s standpoint.
>>>>>>
>>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’
>> (where N is the scbus number shown by ‘camcontrol devlist -v’).  That will
>> do
>> some basic probes for each of the devices and should in theory cause them
>> to go away if they aren’t accessible.
>>>>>>
>>>>>> It seems like the adapter may not be recognizing that the devices in
>> question have gone.
>>>>>
>>>>>
>>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times.
>>>>> While
>> I not sure anymore if that cleans up the non-working devices, I'm sure
>> that
>> no new devices were added.
>>>>
>>>> If doing a read from the device with dd makes it go away, ‘camcontrol
>> rescan all’ should make it go away as well.  It sends command to every
>> device, and if the mpr(4) driver tells CAM the drive is no longer there,
>> it’ll get
>> removed.
>>>>
>>>> If it doesn’t cause the device to get removed (and the rescan doesn’t
>> hang), it means that you’re getting a response from a device that is no
>> longer physically connected to the machine, which is impossible with SAS.
>>>>
>>>>>
>>>>> Unfortunately I haven't gotten yet to Steves 'clear controller
>>>>> mapping'
>> script but I did a few other things:
>>>>
>>>> Steve’s email made it sound like he was going to send it.  I just sent
>>>> it to
>> you separately.
>>>>
>>>>> * The last time I tried to upgrade the firmware I had all sorts of
>> problems. "sas3flash" reported bad checksums while flashing some of the
>> files.
>>>>> So I reflashed both controllers with the DOS version of sas3flash.
>>>>> This
>> was basically a challenge in itself because the DOS version of this
>> utility does
>> not seem to run on computers of this decade. (ERROR:  Failed to initialize
>> PAL.  Exiting program.)
>>>>> The equivalent sas3flash.EFI version seems to be out of date and
>>>>> caused
>> the checksum problems described before.
>>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.)
>>>>
>>>> That is unfortunate…perhaps Steve has some insight.
>>>>
>>>>>
>>>>> * I tried to change mpr tuneable "use_phy_num" after that but this has
>> not improved the situation. I will retry and collect logs with Steves
>> script.
>>>>
>>>> Changed it to what?  I think it defaults to 1.  Did you try 0?
>>>>
>>>>> * I retried with the latest "mpr.ko" from the broadcom download page.
>> (Same problems, no "use_phy_num" tuneable.)
>>>>>
>>>>> * I retested this hardware with Linux (4.15 and 4.17)
>>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45
>> disks disappear, 45 disks reappear)
>>>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44
>> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
>> mpt3sas_cm0: "device is not present handle)
>>>>>
>>>>> * I tired a different controller
>>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
>> (Firmware 16.00.01.00 or 15.00.00.00)
>>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
>> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar
>> with 09*))
>>>>> With the new controller everything seems work on Linux. It might be
>>>>> the
>> old Firmware?...
>>>>> It is better with the new controller on FreeBSD in that sense that I
>>>>> at
>> least get one out of two /dev/sesX devices back. But disks are still
>> missing
>> and are not getting completely cleaned up…
>>>>
>>>> It does sound a bit like a mapping table problem.  Clearing it might
>>>> help,
>> we’ll see.
>>>>
>>>>> This whole thing is a bit frustrating, especially since up until now I
>> thought that HBAs are kind of "connect and forget" devices. Next step is
>> to
>> set up a separate test environment and try to get it to work there. I will
>> keep
>> you updated and try provide log for all FreeBSD related problems.
>>>>
>>>> Thanks for debugging this.  Unfortunately there are a number of ways it
>> can go wrong.  The mapping code has been the source of some problems,
>> sometimes enclosure vendors do the wrong thing, and sometimes there are
>> other bugs.
>>>>
>>>> Ken
>>>>
>>> _______________________________________________
>>> freebsd-scsi@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
>>>
>> _______________________________________________
>> freebsd-scsi@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?77b55ca6-25ce-3b26-e2f6-b0702a49ab28>