Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Jul 2018 10:26:29 -0400
From:      "Kenneth D. Merry" <ken@FreeBSD.ORG>
To:        Oliver Sech <CrimsonThunder@gmx.net>
Cc:        freebsd-scsi@freebsd.org, slm@freebsd.org
Subject:   Re: problems with SAS JBODs 2
Message-ID:  <20180703142629.GF26046@mithlond.kdm.org>
In-Reply-To: <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01>
References:  <trinity-14d18077-ea73-40f6-9e87-d2d4000b1f7e-1530620937871@3c-app-gmx-bs01>

next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jul 03, 2018 at 14:28:58 +0200, Oliver Sech wrote:
> Hi!
> ??
> I use FreeBSD with for a large ZFS pool (over 1PB) and I recently encountered a lot of problems with the JBODs. Generally everything works fine until I replug the shelves.
> ??
> When I start with a clean system and attach a single shelf every thing seems fine.
> -> 44 disks show up, I can use the enclosure services (sesutil) and the system continues to run without problems.
> Once I disconnect the SAS cable, wait until all devices disapear and reconnect I get all sorts of problems.
> -> a random number of disks shows up and the enclosure "ses" do not show up
> Once I restart the system I can start over again.
> ??
> On the server with the large pool there are only certain ports on the HBA that I can use, otherwise disks will be missing after a reboot and my ZFS pool won't go online.
> I tried different firmware on the HBA. I tried the mpr.ko module from the broadcom site. (I replaced the one in /boot/kernel?)
> I tested all the things above with a Linux as OS and everything seems to work.
> ??
> ??
> Is there anything I'm missing? A command that can reset the SAS components?
> ??
> ??
> FreeBSD version: 11.1-RELEASE-p11
> HBA: broadcom lsi 9305-16e (latest firmware)
> JBOD:SC847E2C-R1K28JBOD (two expanders, internally daisy chained)

Steve McConnell (CCed) and I have been corresponding with someone else who
has a problem very similar to yours.

The most likely issue is that the mapping table stored on the card is messed
up.  Can you send dmesg output with the following loader tunable set:

hw.mpr.debug_level=0x203

That will turn on debugging for the mapping code and may show the problem.

If you see messages like this:

mpr0: Attempting to reuse target id 63 handle 0x000b
mpr0: Attempting to reuse target id 64 handle 0x000c
mpr0: Attempting to reuse target id 65 handle 0x000d
mpr0: Attempting to reuse target id 66 handle 0x000e
mpr0: Attempting to reuse target id 67 handle 0x000f
mpr0: Attempting to reuse target id 68 handle 0x0010
mpr0: Attempting to reuse target id 69 handle 0x0011
mpr0: Attempting to reuse target id 70 handle 0x0012
mpr0: Attempting to reuse target id 66 handle 0x000e

It indicates that the mapping code is preventing some of the drives from
fully probing because there are collisions in the table.

Unfortunately we have not yet fixed the problem in the other situation.
(He is running with multipathing, which could be contributing to the
problem.)

I have a script and utility that will clear the mapping table in the card,
but that hasn't been enough to fix the other situation.  If you do have a
mapping problem, I can give you the script/utility to clear the table and
we can see whether it fixes your problem.

If not, it'll probably have to wait until Steve gets back from vacation.

Ken
-- 
Kenneth Merry
ken@FreeBSD.ORG



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20180703142629.GF26046>