From owner-freebsd-scsi@freebsd.org Thu Jul 12 13:38:42 2018 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 200BF103A5B4 for ; Thu, 12 Jul 2018 13:38:42 +0000 (UTC) (envelope-from ken@freebsd.org) Received: from mithlond.kdm.org (mithlond.kdm.org [96.89.93.250]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mithlond.kdm.org", Issuer "mithlond.kdm.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id BAC1791DC7 for ; Thu, 12 Jul 2018 13:38:41 +0000 (UTC) (envelope-from ken@freebsd.org) Received: from [10.0.0.26] (mbp2013.int.kdm.org [10.0.0.26]) (authenticated bits=0) by mithlond.kdm.org (8.15.2/8.14.9) with ESMTPSA id w6CDcdTl017936 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Thu, 12 Jul 2018 09:38:39 -0400 (EDT) (envelope-from ken@freebsd.org) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 11.4 \(3445.8.2\)) Subject: Re: problems with SAS JBODs 2 From: Ken Merry In-Reply-To: <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> Date: Thu, 12 Jul 2018 09:38:36 -0400 Cc: Stephen Mcconnell , FreeBSD-scsi Content-Transfer-Encoding: quoted-printable Message-Id: <7C1E630B-65AD-4FE8-BFDF-F13068070B5E@freebsd.org> References: <237f77ab-89e2-188b-b2b1-84c6d88609b0@gmx.net> <3caf8ccd6fde8cfc4db25bae5327c46b@mail.gmail.com> <0af047d477d15ec364140653bd967c89@mail.gmail.com> <54B10B7C-CDCE-4428-B584-59CE8F38B120@freebsd.org> <9e0bf18f-0689-b2a0-1da4-b70c497b2f14@gmx.net> To: Oliver Sech X-Mailer: Apple Mail (2.3445.8.2) X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.4.3 (mithlond.kdm.org [96.89.93.250]); Thu, 12 Jul 2018 09:38:40 -0400 (EDT) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 12 Jul 2018 13:38:42 -0000 > On Jul 12, 2018, at 6:00 AM, Oliver Sech = wrote: >=20 > On 07/11/2018 10:35 PM, Ken Merry wrote: >> Oliver, what happens when you try to do I/O to the devices that = don=E2=80=99t go away after you pull the cable? Does that cause the = devices to go away? >=20 > I tried to 'dd if=3D/dev/daX of=3D/dev/null bs=3D1k count=3D1' and at = least the "da" device disappears. Ok, that=E2=80=99s good. Can you send the dmesg output and check with = =E2=80=98camcontrol devlist -v=E2=80=99 to make sure the device has = fully gone away? The reason I ask is that I have spent lots of time over the years = debugging device arrival and departure problems in CAM, GEOM and devfs, = and I want to make sure we aren=E2=80=99t running into any non-SAS = related problems. >=20 >> Looking at the mprutil output, it also shows the devices sticking = around from the adapter=E2=80=99s standpoint. >>=20 >> You can also try a =E2=80=98camcontrol rescan all=E2=80=99 or a = =E2=80=98camcontrol rescan N=E2=80=99 (where N is the scbus number shown = by =E2=80=98camcontrol devlist -v=E2=80=99). That will do some basic = probes for each of the devices and should in theory cause them to go = away if they aren=E2=80=99t accessible. >>=20 >> It seems like the adapter may not be recognizing that the devices in = question have gone. >=20 >=20 > I'm pretty sure that I tried this 'camcontrol rescan all' a few times. = While I not sure anymore if that cleans up the non-working devices, I'm = sure that no new devices were added. If doing a read from the device with dd makes it go away, =E2=80=98camcont= rol rescan all=E2=80=99 should make it go away as well. It sends = command to every device, and if the mpr(4) driver tells CAM the drive is = no longer there, it=E2=80=99ll get removed. If it doesn=E2=80=99t cause the device to get removed (and the rescan = doesn=E2=80=99t hang), it means that you=E2=80=99re getting a response = from a device that is no longer physically connected to the machine, = which is impossible with SAS. >=20 > Unfortunately I haven't gotten yet to Steves 'clear controller = mapping' script but I did a few other things: Steve=E2=80=99s email made it sound like he was going to send it. I = just sent it to you separately. > * The last time I tried to upgrade the firmware I had all sorts of = problems. "sas3flash" reported bad checksums while flashing some of the = files. > So I reflashed both controllers with the DOS version of sas3flash. = This was basically a challenge in itself because the DOS version of this = utility does not seem to run on computers of this decade. (ERROR: = Failed to initialize PAL. Exiting program.) > The equivalent sas3flash.EFI version seems to be out of date and = caused the checksum problems described before. > (This time I wiped them before flashing with "sas3flash -o -e 6=E2=80=9D= .) That is unfortunate=E2=80=A6perhaps Steve has some insight. >=20 > * I tried to change mpr tuneable "use_phy_num" after that but this has = not improved the situation. I will retry and collect logs with Steves = script. Changed it to what? I think it defaults to 1. Did you try 0? > * I retried with the latest "mpr.ko" from the broadcom download page. = (Same problems, no "use_phy_num" tuneable.) >=20 > * I retested this hardware with Linux (4.15 and 4.17) > ** Some shelves could be replugged reliably (ie: 45 disks show up, 45 = disks disappear, 45 disks reappear) > ** The newest shelf 2 disks were missing after the replugging (ie: 44 = disks show up, 44 disks disappear, 42 disks reappear) (kernel log = mpt3sas_cm0: "device is not present handle) >=20 > * I tired a different controller > ** So far I used a Broadcom LSI SAS 9305-16e (Controller: SAS3216) = (Firmware 16.00.01.00 or 15.00.00.00) > ** Yesterday I switched to a new fresh out-of-the-box Broadcom LSI = 9305-24i (Controller: SAS3224) (Firmware 09.00.00.00 (or something = similar with 09*)) > With the new controller everything seems work on Linux. It might be = the old Firmware?... > It is better with the new controller on FreeBSD in that sense that I = at least get one out of two /dev/sesX devices back. But disks are still = missing and are not getting completely cleaned up=E2=80=A6 It does sound a bit like a mapping table problem. Clearing it might = help, we=E2=80=99ll see. > This whole thing is a bit frustrating, especially since up until now I = thought that HBAs are kind of "connect and forget" devices. Next step is = to set up a separate test environment and try to get it to work there. I = will keep you updated and try provide log for all FreeBSD related = problems. Thanks for debugging this. Unfortunately there are a number of ways it = can go wrong. The mapping code has been the source of some problems, = sometimes enclosure vendors do the wrong thing, and sometimes there are = other bugs. Ken =20