Date: Wed, 25 Jul 2018 12:24:21 +0200 From: Oliver Sech <crimsonthunder@gmx.net> To: Stephen Mcconnell <stephen.mcconnell@broadcom.com>, FreeBSD-scsi <freebsd-scsi@freebsd.org> Subject: Re: problems with SAS JBODs 2 Message-ID: <77b55ca6-25ce-3b26-e2f6-b0702a49ab28@gmx.net> In-Reply-To: <0f26466617df38fd998dc87948b27273@mail.gmail.com> References: <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01> <CAOtMX2h8r31AeNCKyckK2P0VLn1CKFogo9bWom2So1x2ngpa4A@mail.gmail.com> <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <b785fe02-9242-c95f-56cb-2130f90e17f5@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> <6e0b8652-f227-271e-aeb4-a868ba6b90e2@gmx.net> <530b3e8e-4d76-e601-dd74-0ab6a06ebe25@gmx.net> <0f26466617df38fd998dc87948b27273@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I ran the clear_dpm.sh script and changed the value you suggested. Rebooted and retested. As far as I can tell there is no difference. I tried the menu option (99. Reset port) in lsiutil and this helps with missing devices. After reseting the port I get all my disks and ses devs again. Read NVRAM or current values? [0=NVRAM, 1=Current, default is 0] 0000 : 21080600 0004 : 00000001 0008 : 00180080 000c : 00000001 0010 : 00000000 0014 : 00000000 On 07/24/2018 10:22 PM, Stephen Mcconnell wrote: > Oliver, can you try changing the mapping mode on the controller? I think > you're using Enclosure/Slot Mapping and I want to see what happens with > Device Persistent Mapping. To do that, follow these steps: > 1. Run Ken’s script to clear the DPM entries > 2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page > Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using > Enclosure/Slot Mapping and I'd like you to change this. You will be asked if > you want to make changes. Select ‘yes’ and then change offset 0x0C to > 00000001 (you might have to type C instead of 0x0C for the offset). Just use > the default setting to change NVRAM. > 3. Reboot and see what happens and let me know how it goes. > > > Steve > >> -----Original Message----- >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >> scsi@freebsd.org] On Behalf Of Oliver Sech >> Sent: Tuesday, July 24, 2018 12:23 PM >> To: FreeBSD-scsi >> Subject: Re: problems with SAS JBODs 2 >> >> update 2: I continued to test with more and different hardware. >> >> tested with a LSI SAS9207-8e HBA: >> * after disconnect all devices properly disappear /dev/daX /dev/ses >> no rescans or writing necessary >> * no more targets in mpsutil (not mprutil) >> * after reconnect all disks and all ses devs appear! >> >> tested with hardware raid LSI SAS 9286CV-8e >> * no problems with the shelf/sas in different configurations >> * switching the controller and importing configuration works reliably >> >> So far I think there is a problem with the mpr driver and I'm quite >> confident >> that it does affect other people. >> With a simple configuration is probably not immediately noticeable as >> everything seems to work after the first connect/boot. >> It probably gets scarier for people with multipathing and big SAS chains I >> guess... >> >> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If >> there is >> anything I can help with while I still have hardware in the lab let me >> know. >> >> Oliver >> >> On 07/23/2018 04:14 PM, Oliver Sech wrote: >>> Sorry for the delay. I moved to a different office and could not focus >>> on >> this issue last week. >>> >>> I tested all of the hardware with different drivers and firmware on >>> Linux to >> make sure this is not a hardware problem: >>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD >>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD >>> * Firmware 16.00.01.00 + Driver 26.000.00.00 -> BAD (42 out of 44 disks >> after reconnect) >>> * Firmware 16.00.01.00 + Driver 12.100.00.00 -> BAD (42 out of 44 disks >> after reconnect) >>> >>> I tested a different HBA with an old firmware as well and there were no >> issues. Only with the latest FW disks are missing after a reconnect with >> the >> error "mpt3sas_cm0: "device is not present handle" >>> I don't know yet how different Firmware behaves between version >> 09.00.000.00 and 16... >>> >>> Additional Info/Changes: >>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No >> Change >>> * "camcontrol rescan all" removes the devices that are still present >>> after >> the cable has been removed. "camcontrol devlist -v" does not show them >> anymore >>> >>> >>> Setting the driver "use_phy_num" to 0 and using the clearDPM script >> between connects does not help. In fact I do not see a different behavior >> at >> all? >>> I reflashed the controller multiple times and erased everything except >>> the >> "manufacturing" area to make sure that no previous settings are kept. >>> The only thing I know that "fixes" the missing drives is to reboot the >>> server. >>> >>> A (similar?) problem also occurs once I start the server with all 6 disk >> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up >> properly with 5 shelves, once I offline connect the 6th shelve, then some >> random disks are missing and I cannot longer import the ZFS pool. >>> >>> The following logs were collected with the very old FW 09.00.101.00 that >> worked on Linux. >>> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0 >>> >>> best regards, >>> Oliver >>> >>> On 07/12/2018 03:38 PM, Ken Merry wrote: >>>> >>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech <crimsonthunder@gmx.net> >> wrote: >>>>> >>>>> On 07/11/2018 10:35 PM, Ken Merry wrote: >>>>>> Oliver, what happens when you try to do I/O to the devices that don’t >> go away after you pull the cable? Does that cause the devices to go away? >>>>> >>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least >>>>> the >> "da" device disappears. >>>> >>>> Ok, that’s good. Can you send the dmesg output and check with >> ‘camcontrol devlist -v’ to make sure the device has fully gone away? >>>> >>>> The reason I ask is that I have spent lots of time over the years >>>> debugging >> device arrival and departure problems in CAM, GEOM and devfs, and I want >> to make sure we aren’t running into any non-SAS related problems. >>>> >>>>> >>>>>> Looking at the mprutil output, it also shows the devices sticking >>>>>> around >> from the adapter’s standpoint. >>>>>> >>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ >> (where N is the scbus number shown by ‘camcontrol devlist -v’). That will >> do >> some basic probes for each of the devices and should in theory cause them >> to go away if they aren’t accessible. >>>>>> >>>>>> It seems like the adapter may not be recognizing that the devices in >> question have gone. >>>>> >>>>> >>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. >>>>> While >> I not sure anymore if that cleans up the non-working devices, I'm sure >> that >> no new devices were added. >>>> >>>> If doing a read from the device with dd makes it go away, ‘camcontrol >> rescan all’ should make it go away as well. It sends command to every >> device, and if the mpr(4) driver tells CAM the drive is no longer there, >> it’ll get >> removed. >>>> >>>> If it doesn’t cause the device to get removed (and the rescan doesn’t >> hang), it means that you’re getting a response from a device that is no >> longer physically connected to the machine, which is impossible with SAS. >>>> >>>>> >>>>> Unfortunately I haven't gotten yet to Steves 'clear controller >>>>> mapping' >> script but I did a few other things: >>>> >>>> Steve’s email made it sound like he was going to send it. I just sent >>>> it to >> you separately. >>>> >>>>> * The last time I tried to upgrade the firmware I had all sorts of >> problems. "sas3flash" reported bad checksums while flashing some of the >> files. >>>>> So I reflashed both controllers with the DOS version of sas3flash. >>>>> This >> was basically a challenge in itself because the DOS version of this >> utility does >> not seem to run on computers of this decade. (ERROR: Failed to initialize >> PAL. Exiting program.) >>>>> The equivalent sas3flash.EFI version seems to be out of date and >>>>> caused >> the checksum problems described before. >>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.) >>>> >>>> That is unfortunate…perhaps Steve has some insight. >>>> >>>>> >>>>> * I tried to change mpr tuneable "use_phy_num" after that but this has >> not improved the situation. I will retry and collect logs with Steves >> script. >>>> >>>> Changed it to what? I think it defaults to 1. Did you try 0? >>>> >>>>> * I retried with the latest "mpr.ko" from the broadcom download page. >> (Same problems, no "use_phy_num" tuneable.) >>>>> >>>>> * I retested this hardware with Linux (4.15 and 4.17) >>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 >> disks disappear, 45 disks reappear) >>>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44 >> disks show up, 44 disks disappear, 42 disks reappear) (kernel log >> mpt3sas_cm0: "device is not present handle) >>>>> >>>>> * I tired a different controller >>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) >> (Firmware 16.00.01.00 or 15.00.00.00) >>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI >> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar >> with 09*)) >>>>> With the new controller everything seems work on Linux. It might be >>>>> the >> old Firmware?... >>>>> It is better with the new controller on FreeBSD in that sense that I >>>>> at >> least get one out of two /dev/sesX devices back. But disks are still >> missing >> and are not getting completely cleaned up… >>>> >>>> It does sound a bit like a mapping table problem. Clearing it might >>>> help, >> we’ll see. >>>> >>>>> This whole thing is a bit frustrating, especially since up until now I >> thought that HBAs are kind of "connect and forget" devices. Next step is >> to >> set up a separate test environment and try to get it to work there. I will >> keep >> you updated and try provide log for all FreeBSD related problems. >>>> >>>> Thanks for debugging this. Unfortunately there are a number of ways it >> can go wrong. The mapping code has been the source of some problems, >> sometimes enclosure vendors do the wrong thing, and sometimes there are >> other bugs. >>>> >>>> Ken >>>> >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >>> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?77b55ca6-25ce-3b26-e2f6-b0702a49ab28>