Date: Thu, 12 Jul 2018 09:38:36 -0400 From: Ken Merry <ken@freebsd.org> To: Oliver Sech <crimsonthunder@gmx.net> Cc: Stephen Mcconnell <stephen.mcconnell@broadcom.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: problems with SAS JBODs 2 Message-ID: <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> In-Reply-To: <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> References: <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01> <CAOtMX2h8r31AeNCKyckK2P0VLn1CKFogo9bWom2So1x2ngpa4A@mail.gmail.com> <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <b785fe02-9242-c95f-56cb-2130f90e17f5@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net>
next in thread | previous in thread | raw e-mail | index | archive | help
> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder@gmx.net> = wrote: >=20 > On 07/11/2018 10:35 PM, Ken Merry wrote: >> Oliver, what happens when you try to do I/O to the devices that = don=E2=80=99t go away after you pull the cable? Does that cause the = devices to go away? >=20 > I tried to 'dd if=3D/dev/daX of=3D/dev/null bs=3D1k count=3D1' and at = least the "da" device disappears. Ok, that=E2=80=99s good. Can you send the dmesg output and check with = =E2=80=98camcontrol devlist -v=E2=80=99 to make sure the device has = fully gone away? The reason I ask is that I have spent lots of time over the years = debugging device arrival and departure problems in CAM, GEOM and devfs, = and I want to make sure we aren=E2=80=99t running into any non-SAS = related problems. >=20 >> Looking at the mprutil output, it also shows the devices sticking = around from the adapter=E2=80=99s standpoint. >>=20 >> You can also try a =E2=80=98camcontrol rescan all=E2=80=99 or a = =E2=80=98camcontrol rescan N=E2=80=99 (where N is the scbus number shown = by =E2=80=98camcontrol devlist -v=E2=80=99). That will do some basic = probes for each of the devices and should in theory cause them to go = away if they aren=E2=80=99t accessible. >>=20 >> It seems like the adapter may not be recognizing that the devices in = question have gone. >=20 >=20 > I'm pretty sure that I tried this 'camcontrol rescan all' a few times. = While I not sure anymore if that cleans up the non-working devices, I'm = sure that no new devices were added. If doing a read from the device with dd makes it go away, =E2=80=98camcont= rol rescan all=E2=80=99 should make it go away as well. It sends = command to every device, and if the mpr(4) driver tells CAM the drive is = no longer there, it=E2=80=99ll get removed. If it doesn=E2=80=99t cause the device to get removed (and the rescan = doesn=E2=80=99t hang), it means that you=E2=80=99re getting a response = from a device that is no longer physically connected to the machine, = which is impossible with SAS. >=20 > Unfortunately I haven't gotten yet to Steves 'clear controller = mapping' script but I did a few other things: Steve=E2=80=99s email made it sound like he was going to send it. I = just sent it to you separately. > * The last time I tried to upgrade the firmware I had all sorts of = problems. "sas3flash" reported bad checksums while flashing some of the = files. > So I reflashed both controllers with the DOS version of sas3flash. = This was basically a challenge in itself because the DOS version of this = utility does not seem to run on computers of this decade. (ERROR: = Failed to initialize PAL. Exiting program.) > The equivalent sas3flash.EFI version seems to be out of date and = caused the checksum problems described before. > (This time I wiped them before flashing with "sas3flash -o -e 6=E2=80=9D= .) That is unfortunate=E2=80=A6perhaps Steve has some insight. >=20 > * I tried to change mpr tuneable "use_phy_num" after that but this has = not improved the situation. I will retry and collect logs with Steves = script. Changed it to what? I think it defaults to 1. Did you try 0? > * I retried with the latest "mpr.ko" from the broadcom download page. = (Same problems, no "use_phy_num" tuneable.) >=20 > * I retested this hardware with Linux (4.15 and 4.17) > ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 = disks disappear, 45 disks reappear) > ** The newest shelf 2 disks were missing after the replugging (ie: 44 = disks show up, 44 disks disappear, 42 disks reappear) (kernel log = mpt3sas_cm0: "device is not present handle) >=20 > * I tired a different controller > ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) = (Firmware 16.00.01.00 or 15.00.00.00) > ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI = 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something = similar with 09*)) > With the new controller everything seems work on Linux. It might be = the old Firmware?... > It is better with the new controller on FreeBSD in that sense that I = at least get one out of two /dev/sesX devices back. But disks are still = missing and are not getting completely cleaned up=E2=80=A6 It does sound a bit like a mapping table problem. Clearing it might = help, we=E2=80=99ll see. > This whole thing is a bit frustrating, especially since up until now I = thought that HBAs are kind of "connect and forget" devices. Next step is = to set up a separate test environment and try to get it to work there. I = will keep you updated and try provide log for all FreeBSD related = problems. Thanks for debugging this. Unfortunately there are a number of ways it = can go wrong. The mapping code has been the source of some problems, = sometimes enclosure vendors do the wrong thing, and sometimes there are = other bugs. Ken =20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7C1E630B-65AD-4FE8-BFDF-F13068070B5E>