Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 24 Jul 2018 14:22:28 -0600
From:      Stephen Mcconnell <stephen.mcconnell@broadcom.com>
To:        Oliver Sech <crimsonthunder@gmx.net>, FreeBSD-scsi <freebsd-scsi@freebsd.org>
Subject:   RE: problems with SAS JBODs 2
Message-ID:  <0f26466617df38fd998dc87948b27273@mail.gmail.com>
In-Reply-To: <530b3e8e-4d76-e601-dd74-0ab6a06ebe25@gmx.net>
References:  <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01> <CAOtMX2h8r31AeNCKyckK2P0VLn1CKFogo9bWom2So1x2ngpa4A@mail.gmail.com> <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <b785fe02-9242-c95f-56cb-2130f90e17f5@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> <6e0b8652-f227-271e-aeb4-a868ba6b90e2@gmx.net> <530b3e8e-4d76-e601-dd74-0ab6a06ebe25@gmx.net>

next in thread | previous in thread | raw e-mail | index | archive | help
Oliver, can you try changing the mapping mode on the controller? I think
you're using Enclosure/Slot Mapping and I want to see what happens with
Device Persistent Mapping. To do that, follow these steps:
1.	Run Ken=E2=80=99s script to clear the DPM entries
2.	Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page
Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using
Enclosure/Slot Mapping and I'd like you to change this. You will be asked i=
f
you want to make changes. Select =E2=80=98yes=E2=80=99 and then change offs=
et 0x0C to
00000001 (you might have to type C instead of 0x0C for the offset). Just us=
e
the default setting to change NVRAM.
3.	Reboot and see what happens and let me know how it goes.


Steve

> -----Original Message-----
> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd-
> scsi@freebsd.org] On Behalf Of Oliver Sech
> Sent: Tuesday, July 24, 2018 12:23 PM
> To: FreeBSD-scsi
> Subject: Re: problems with SAS JBODs 2
>
> update 2: I continued to test with more and different hardware.
>
> tested with a LSI SAS9207-8e HBA:
> * after disconnect all devices properly disappear /dev/daX /dev/ses
> no rescans or writing necessary
> * no more targets in mpsutil (not mprutil)
> * after reconnect all disks and all ses devs appear!
>
> tested with hardware raid LSI SAS 9286CV-8e
> * no problems with the shelf/sas in different configurations
> * switching the controller and importing configuration works reliably
>
> So far I think there is a problem with the mpr driver and I'm quite
> confident
> that it does affect other people.
> With a simple configuration is probably not immediately noticeable as
> everything seems to work after the first connect/boot.
> It probably gets scarier for people with multipathing and big SAS chains =
I
> guess...
>
> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If
> there is
> anything I can help with while I still have hardware in the lab let me
> know.
>
> Oliver
>
> On 07/23/2018 04:14 PM, Oliver Sech wrote:
> > Sorry for the delay. I moved to a different office and could not focus
> > on
> this issue last week.
> >
> > I tested all of the hardware with different drivers and firmware on
> > Linux to
> make sure this is not a hardware problem:
> > * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD
> > * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD
> > * Firmware 16.00.01.00  + Driver 26.000.00.00 -> BAD (42 out of 44 disk=
s
> after reconnect)
> > * Firmware 16.00.01.00  + Driver 12.100.00.00 -> BAD (42 out of 44 disk=
s
> after reconnect)
> >
> > I tested a different HBA with an old firmware as well and there were no
> issues. Only with the latest FW disks are missing after a reconnect with
> the
> error "mpt3sas_cm0: "device is not present handle"
> > I don't know yet how different Firmware behaves between version
> 09.00.000.00 and 16...
> >
> > Additional Info/Changes:
> > * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No
> Change
> > * "camcontrol rescan all" removes the devices that are still present
> > after
> the cable has been removed. "camcontrol devlist -v" does not show them
> anymore
> >
> >
> > Setting the driver "use_phy_num" to 0 and using the clearDPM script
> between connects does not help. In fact I do not see a different behavior
> at
> all?
> > I reflashed the controller multiple times and erased everything except
> > the
> "manufacturing" area to make sure that no previous settings are kept.
> > The only thing I know that "fixes" the missing drives is to reboot the
> > server.
> >
> > A (similar?) problem also occurs once I start the server with all 6 dis=
k
> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up
> properly with 5 shelves, once I offline connect the 6th shelve, then some
> random disks are missing and I cannot longer import the ZFS pool.
> >
> > The following logs were collected with the very old FW 09.00.101.00 tha=
t
> worked on Linux.
> > Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=3D0
> >
> > best regards,
> > Oliver
> >
> > On 07/12/2018 03:38 PM, Ken Merry wrote:
> >>
> >>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder@gmx.net>
> wrote:
> >>>
> >>> On 07/11/2018 10:35 PM, Ken Merry wrote:
> >>>> Oliver, what happens when you try to do I/O to the devices that don=
=E2=80=99t
> go away after you pull the cable?  Does that cause the devices to go away=
?
> >>>
> >>> I tried to 'dd if=3D/dev/daX of=3D/dev/null bs=3D1k count=3D1' and at=
 least
> >>> the
> "da" device disappears.
> >>
> >> Ok, that=E2=80=99s good.  Can you send the dmesg output and check with
> =E2=80=98camcontrol devlist -v=E2=80=99 to make sure the device has fully=
 gone away?
> >>
> >> The reason I ask is that I have spent lots of time over the years
> >> debugging
> device arrival and departure problems in CAM, GEOM and devfs, and I want
> to make sure we aren=E2=80=99t running into any non-SAS related problems.
> >>
> >>>
> >>>> Looking at the mprutil output, it also shows the devices sticking
> >>>> around
> from the adapter=E2=80=99s standpoint.
> >>>>
> >>>> You can also try a =E2=80=98camcontrol rescan all=E2=80=99 or a =E2=
=80=98camcontrol rescan N=E2=80=99
> (where N is the scbus number shown by =E2=80=98camcontrol devlist -v=E2=
=80=99).  That will
> do
> some basic probes for each of the devices and should in theory cause them
> to go away if they aren=E2=80=99t accessible.
> >>>>
> >>>> It seems like the adapter may not be recognizing that the devices in
> question have gone.
> >>>
> >>>
> >>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times=
.
> >>> While
> I not sure anymore if that cleans up the non-working devices, I'm sure
> that
> no new devices were added.
> >>
> >> If doing a read from the device with dd makes it go away, =E2=80=98cam=
control
> rescan all=E2=80=99 should make it go away as well.  It sends command to =
every
> device, and if the mpr(4) driver tells CAM the drive is no longer there,
> it=E2=80=99ll get
> removed.
> >>
> >> If it doesn=E2=80=99t cause the device to get removed (and the rescan =
doesn=E2=80=99t
> hang), it means that you=E2=80=99re getting a response from a device that=
 is no
> longer physically connected to the machine, which is impossible with SAS.
> >>
> >>>
> >>> Unfortunately I haven't gotten yet to Steves 'clear controller
> >>> mapping'
> script but I did a few other things:
> >>
> >> Steve=E2=80=99s email made it sound like he was going to send it.  I j=
ust sent
> >> it to
> you separately.
> >>
> >>> * The last time I tried to upgrade the firmware I had all sorts of
> problems. "sas3flash" reported bad checksums while flashing some of the
> files.
> >>> So I reflashed both controllers with the DOS version of sas3flash.
> >>> This
> was basically a challenge in itself because the DOS version of this
> utility does
> not seem to run on computers of this decade. (ERROR:  Failed to initializ=
e
> PAL.  Exiting program.)
> >>> The equivalent sas3flash.EFI version seems to be out of date and
> >>> caused
> the checksum problems described before.
> >>> (This time I wiped them before flashing with "sas3flash -o -e 6=E2=80=
=9D.)
> >>
> >> That is unfortunate=E2=80=A6perhaps Steve has some insight.
> >>
> >>>
> >>> * I tried to change mpr tuneable "use_phy_num" after that but this ha=
s
> not improved the situation. I will retry and collect logs with Steves
> script.
> >>
> >> Changed it to what?  I think it defaults to 1.  Did you try 0?
> >>
> >>> * I retried with the latest "mpr.ko" from the broadcom download page.
> (Same problems, no "use_phy_num" tuneable.)
> >>>
> >>> * I retested this hardware with Linux (4.15 and 4.17)
> >>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45
> disks disappear, 45 disks reappear)
> >>> ** The newest shelf 2 disks were missing after the replugging (ie: 44
> disks show up, 44 disks disappear, 42 disks reappear) (kernel log
> mpt3sas_cm0: "device is not present handle)
> >>>
> >>> * I tired a different controller
> >>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216)
> (Firmware 16.00.01.00 or 15.00.00.00)
> >>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI
> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something simila=
r
> with 09*))
> >>> With the new controller everything seems work on Linux. It might be
> >>> the
> old Firmware?...
> >>> It is better with the new controller on FreeBSD in that sense that I
> >>> at
> least get one out of two /dev/sesX devices back. But disks are still
> missing
> and are not getting completely cleaned up=E2=80=A6
> >>
> >> It does sound a bit like a mapping table problem.  Clearing it might
> >> help,
> we=E2=80=99ll see.
> >>
> >>> This whole thing is a bit frustrating, especially since up until now =
I
> thought that HBAs are kind of "connect and forget" devices. Next step is
> to
> set up a separate test environment and try to get it to work there. I wil=
l
> keep
> you updated and try provide log for all FreeBSD related problems.
> >>
> >> Thanks for debugging this.  Unfortunately there are a number of ways i=
t
> can go wrong.  The mapping code has been the source of some problems,
> sometimes enclosure vendors do the wrong thing, and sometimes there are
> other bugs.
> >>
> >> Ken
> >>
> > _______________________________________________
> > freebsd-scsi@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
> >
> _______________________________________________
> freebsd-scsi@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?0f26466617df38fd998dc87948b27273>