From owner-freebsd-scsi@freebsd.org Wed Jul 25 10:24:27 2018 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2B6531046D5E for ; Wed, 25 Jul 2018 10:24:27 +0000 (UTC) (envelope-from crimsonthunder@gmx.net) Received: from mout.gmx.net (mout.gmx.net [212.227.17.21]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 842387860F for ; Wed, 25 Jul 2018 10:24:26 +0000 (UTC) (envelope-from crimsonthunder@gmx.net) Received: from [10.12.22.246] ([193.170.152.64]) by mail.gmx.com (mrgmx101 [212.227.17.168]) with ESMTPSA (Nemesis) id 0M0cs6-1fzNHP3oCs-00unjr; Wed, 25 Jul 2018 12:24:21 +0200 Subject: Re: problems with SAS JBODs 2 To: Stephen Mcconnell , FreeBSD-scsi References: <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> <6e0b8652-f227-271e-aeb4-a868ba6b90e2@gmx.net> <530b3e8e-4d76-e601-dd74-0ab6a06ebe25@gmx.net> <0f26466617df38fd998dc87948b27273@mail.gmail.com> From: Oliver Sech Message-ID: <77b55ca6-25ce-3b26-e2f6-b0702a49ab28@gmx.net> Date: Wed, 25 Jul 2018 12:24:21 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <0f26466617df38fd998dc87948b27273@mail.gmail.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Provags-ID: V03:K1:OE1khuBCBjHPQT9lCP+GzBbd7KNBWvUyWcrVuvvcBmhKn//rg4A 8BJyOTe9jjwhA/P4fkZVUNLdtu7GFKxKF1Qket/1+F15uvBXpk+CdwcxtmRctm39d3JZGTd 7h5g+V2ljh8+AnPQPzl12taCJ1KtMR3hxfbqDiyz+bziODSm59tXad6Kzh34PH6stdJ85KY ouz3gksY9kZQs4CxfaWGQ== X-UI-Out-Filterresults: notjunk:1;V01:K0:QQWzNYxvuqc=:tF7xFftDOa7/K7zESDxlyw PvH703pQ3Ei89ZQZsBL+IwRTuiowkdEW9bwvSmsVQS9x+f/5RZ5ubk2crUbLdRLZuiq9Ov2X6 ULOyRU0j6XQCIuEiQmbHDiBFWgSl9MAVMQNlyopXFCP/bHWKhWAke+bSPQNeKI9U1CdY2+zPg J1Doi/xQa98QOoLOy8s2SGFZBzaIRX8hAVSDTUqYsm7o3ew492Bc9q1r0THn/6G/zBN6YZ2Af k9NY7QZPeT3Tb3HKFzAioua5Q9uQH3UJHsJwo4u21+4PQZ9N1nSKgQwxHHRu+vNHyrnb5w6R7 jcPZuAuccH5ABGVRJgzvltiME1+Yf9LKVwXM9k2tfPqY50D92G/ges4xfG/bspm6KG8D7EZ4t TW+vKIxAWlkyVABUP6ktDPNFaV+As36ElKaZcyb8/BBZd4FLdN8TvNoIht2uDhboc5S24tw0o brZR+cpb4YCNR9BUjuX1XlL+U6ePAivuyKcIyaofhEMreLKe4fF5LfIoLliVVvFJwBt0eM/uO elroXxyUDeprsZ12+iGcP4iYOMGmP/txAkibDCXeC3NpEEAIcx/3F1hEyzneNomt2qc+U1BfW 5w6xadFQt/4Tv+yq572KzgpAXlxhub+NPmsANIIiFDUNToioWoMKWoEDa1n3VSTJ+TgDzBA5D g1/YwAoNCdYy8dZdgHta+zWcTnQt9ypFAoUnlpPlUWjkLcBKAh97bv/azSRa2S2VVJY/iOkZx sO9IhTvpB+Hfwn/RaCJWCGA6xnrnOnqI/OChsbx/XCmOkeFQw2ZeYqPi1qjjEa3u7DL/mW7yH HLIjJYr X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2018 10:24:27 -0000 I ran the clear_dpm.sh script and changed the value you suggested. Rebooted and retested. As far as I can tell there is no difference. I tried the menu option (99. Reset port) in lsiutil and this helps with missing devices. After reseting the port I get all my disks and ses devs again. Read NVRAM or current values? [0=NVRAM, 1=Current, default is 0] 0000 : 21080600 0004 : 00000001 0008 : 00180080 000c : 00000001 0010 : 00000000 0014 : 00000000 On 07/24/2018 10:22 PM, Stephen Mcconnell wrote: > Oliver, can you try changing the mapping mode on the controller? I think > you're using Enclosure/Slot Mapping and I want to see what happens with > Device Persistent Mapping. To do that, follow these steps: > 1. Run Ken’s script to clear the DPM entries > 2. Use LSIUtil to change the mapping mode in IOC Page 8. Command 9, Page > Type 1, Page Number 8. If you see 0000002 at offset 0x0C you're using > Enclosure/Slot Mapping and I'd like you to change this. You will be asked if > you want to make changes. Select ‘yes’ and then change offset 0x0C to > 00000001 (you might have to type C instead of 0x0C for the offset). Just use > the default setting to change NVRAM. > 3. Reboot and see what happens and let me know how it goes. > > > Steve > >> -----Original Message----- >> From: owner-freebsd-scsi@freebsd.org [mailto:owner-freebsd- >> scsi@freebsd.org] On Behalf Of Oliver Sech >> Sent: Tuesday, July 24, 2018 12:23 PM >> To: FreeBSD-scsi >> Subject: Re: problems with SAS JBODs 2 >> >> update 2: I continued to test with more and different hardware. >> >> tested with a LSI SAS9207-8e HBA: >> * after disconnect all devices properly disappear /dev/daX /dev/ses >> no rescans or writing necessary >> * no more targets in mpsutil (not mprutil) >> * after reconnect all disks and all ses devs appear! >> >> tested with hardware raid LSI SAS 9286CV-8e >> * no problems with the shelf/sas in different configurations >> * switching the controller and importing configuration works reliably >> >> So far I think there is a problem with the mpr driver and I'm quite >> confident >> that it does affect other people. >> With a simple configuration is probably not immediately noticeable as >> everything seems to work after the first connect/boot. >> It probably gets scarier for people with multipathing and big SAS chains I >> guess... >> >> I will downgrade to SAS2 HBAs shortly as I'm running out of space. If >> there is >> anything I can help with while I still have hardware in the lab let me >> know. >> >> Oliver >> >> On 07/23/2018 04:14 PM, Oliver Sech wrote: >>> Sorry for the delay. I moved to a different office and could not focus >>> on >> this issue last week. >>> >>> I tested all of the hardware with different drivers and firmware on >>> Linux to >> make sure this is not a hardware problem: >>> * Firmware 09.00.101.00 + Driver 26.000.00.00 (compiled) -> GOOD >>> * Firmware 09.00.101.00 + Driver 12.100.00.00 (default kernel) -> GOOD >>> * Firmware 16.00.01.00 + Driver 26.000.00.00 -> BAD (42 out of 44 disks >> after reconnect) >>> * Firmware 16.00.01.00 + Driver 12.100.00.00 -> BAD (42 out of 44 disks >> after reconnect) >>> >>> I tested a different HBA with an old firmware as well and there were no >> issues. Only with the latest FW disks are missing after a reconnect with >> the >> error "mpt3sas_cm0: "device is not present handle" >>> I don't know yet how different Firmware behaves between version >> 09.00.000.00 and 16... >>> >>> Additional Info/Changes: >>> * Upgraded testsystem to 11.2 as suggested in the mailing list. -> No >> Change >>> * "camcontrol rescan all" removes the devices that are still present >>> after >> the cable has been removed. "camcontrol devlist -v" does not show them >> anymore >>> >>> >>> Setting the driver "use_phy_num" to 0 and using the clearDPM script >> between connects does not help. In fact I do not see a different behavior >> at >> all? >>> I reflashed the controller multiple times and erased everything except >>> the >> "manufacturing" area to make sure that no previous settings are kept. >>> The only thing I know that "fixes" the missing drives is to reboot the >>> server. >>> >>> A (similar?) problem also occurs once I start the server with all 6 disk >> shelves (11 backplanes, 17 expanders, 200+ disks). Everything comes up >> properly with 5 shelves, once I offline connect the 6th shelve, then some >> random disks are missing and I cannot longer import the ZFS pool. >>> >>> The following logs were collected with the very old FW 09.00.101.00 that >> worked on Linux. >>> Logs: https://www.dropbox.com/s/6nw88rt6ajh713s/freebsd_sas3.zip?dl=0 >>> >>> best regards, >>> Oliver >>> >>> On 07/12/2018 03:38 PM, Ken Merry wrote: >>>> >>>>> On Jul 12, 2018, at 6:00 AM, Oliver Sech >> wrote: >>>>> >>>>> On 07/11/2018 10:35 PM, Ken Merry wrote: >>>>>> Oliver, what happens when you try to do I/O to the devices that don’t >> go away after you pull the cable? Does that cause the devices to go away? >>>>> >>>>> I tried to 'dd if=/dev/daX of=/dev/null bs=1k count=1' and at least >>>>> the >> "da" device disappears. >>>> >>>> Ok, that’s good. Can you send the dmesg output and check with >> ‘camcontrol devlist -v’ to make sure the device has fully gone away? >>>> >>>> The reason I ask is that I have spent lots of time over the years >>>> debugging >> device arrival and departure problems in CAM, GEOM and devfs, and I want >> to make sure we aren’t running into any non-SAS related problems. >>>> >>>>> >>>>>> Looking at the mprutil output, it also shows the devices sticking >>>>>> around >> from the adapter’s standpoint. >>>>>> >>>>>> You can also try a ‘camcontrol rescan all’ or a ‘camcontrol rescan N’ >> (where N is the scbus number shown by ‘camcontrol devlist -v’). That will >> do >> some basic probes for each of the devices and should in theory cause them >> to go away if they aren’t accessible. >>>>>> >>>>>> It seems like the adapter may not be recognizing that the devices in >> question have gone. >>>>> >>>>> >>>>> I'm pretty sure that I tried this 'camcontrol rescan all' a few times. >>>>> While >> I not sure anymore if that cleans up the non-working devices, I'm sure >> that >> no new devices were added. >>>> >>>> If doing a read from the device with dd makes it go away, ‘camcontrol >> rescan all’ should make it go away as well. It sends command to every >> device, and if the mpr(4) driver tells CAM the drive is no longer there, >> it’ll get >> removed. >>>> >>>> If it doesn’t cause the device to get removed (and the rescan doesn’t >> hang), it means that you’re getting a response from a device that is no >> longer physically connected to the machine, which is impossible with SAS. >>>> >>>>> >>>>> Unfortunately I haven't gotten yet to Steves 'clear controller >>>>> mapping' >> script but I did a few other things: >>>> >>>> Steve’s email made it sound like he was going to send it. I just sent >>>> it to >> you separately. >>>> >>>>> * The last time I tried to upgrade the firmware I had all sorts of >> problems. "sas3flash" reported bad checksums while flashing some of the >> files. >>>>> So I reflashed both controllers with the DOS version of sas3flash. >>>>> This >> was basically a challenge in itself because the DOS version of this >> utility does >> not seem to run on computers of this decade. (ERROR: Failed to initialize >> PAL. Exiting program.) >>>>> The equivalent sas3flash.EFI version seems to be out of date and >>>>> caused >> the checksum problems described before. >>>>> (This time I wiped them before flashing with "sas3flash -o -e 6”.) >>>> >>>> That is unfortunate…perhaps Steve has some insight. >>>> >>>>> >>>>> * I tried to change mpr tuneable "use_phy_num" after that but this has >> not improved the situation. I will retry and collect logs with Steves >> script. >>>> >>>> Changed it to what? I think it defaults to 1. Did you try 0? >>>> >>>>> * I retried with the latest "mpr.ko" from the broadcom download page. >> (Same problems, no "use_phy_num" tuneable.) >>>>> >>>>> * I retested this hardware with Linux (4.15 and 4.17) >>>>> ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 >> disks disappear, 45 disks reappear) >>>>> ** The newest shelf 2 disks were missing after the replugging (ie: 44 >> disks show up, 44 disks disappear, 42 disks reappear) (kernel log >> mpt3sas_cm0: "device is not present handle) >>>>> >>>>> * I tired a different controller >>>>> ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) >> (Firmware 16.00.01.00 or 15.00.00.00) >>>>> ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI >> 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something similar >> with 09*)) >>>>> With the new controller everything seems work on Linux. It might be >>>>> the >> old Firmware?... >>>>> It is better with the new controller on FreeBSD in that sense that I >>>>> at >> least get one out of two /dev/sesX devices back. But disks are still >> missing >> and are not getting completely cleaned up… >>>> >>>> It does sound a bit like a mapping table problem. Clearing it might >>>> help, >> we’ll see. >>>> >>>>> This whole thing is a bit frustrating, especially since up until now I >> thought that HBAs are kind of "connect and forget" devices. Next step is >> to >> set up a separate test environment and try to get it to work there. I will >> keep >> you updated and try provide log for all FreeBSD related problems. >>>> >>>> Thanks for debugging this. Unfortunately there are a number of ways it >> can go wrong. The mapping code has been the source of some problems, >> sometimes enclosure vendors do the wrong thing, and sometimes there are >> other bugs. >>>> >>>> Ken >>>> >>> _______________________________________________ >>> freebsd-scsi@freebsd.org mailing list >>> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >>> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >>> >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org"