Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 7 Feb 2019 12:51:03 +0100
From:      Borja Marcos <borjam@sarenet.es>
To:        Karl Denninger <karl@denninger.net>
Cc:        freebsd-stable@freebsd.org
Subject:   Re: 9211 (LSI/SAS) issues on 11.2-STABLE
Message-ID:  <1CCD5D4C-BE41-49DE-AF87-40EBFA8038B1@sarenet.es>
In-Reply-To: <76c295f5-fc2b-2c9f-78b1-163939afb24a@denninger.net>
References:  <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net> <b50c527c-e7f7-3e64-af3a-e597ec77c021@denninger.net> <9ea70420-0c06-ad9d-e8b7-f9d92fed20d8@denninger.net> <57ddc2f4-681c-e0aa-0484-42cee3876a05@denninger.net> <1FFC1686-E70F-4649-B170-34F90B773918@sarenet.es> <76c295f5-fc2b-2c9f-78b1-163939afb24a@denninger.net>

next in thread | previous in thread | raw e-mail | index | archive | help


> On 6 Feb 2019, at 16:34, Karl Denninger <karl@denninger.net> wrote:
>=20
> On 2/6/2019 09:18, Borja Marcos wrote:
>>>> Number of Hardware Resets has incremented.  There are no other =
errors shown:
>> What is _exactly_ that value? Is it related to the number of resets =
sent from the HBA
>> _or_ the device resetting by itself?
> Good question.  What counts as a "reset"; UNIT ATTENTION is what the
> controller receives but whether that's a power reset, a reset =
*command*
> from the HBA or a firmware crash (yikes!) in the disk I'm not certain.

In my youth I wrote software for tape drives. After a reset, no matter =
how it was
initiated (the device itself or the HBA) the device will give you a UNIT =
ATTENTION
if I remember well (25 years ago).=20


>>>> I'd throw possible shade at the backplane or cable /but I have =
already
>>>> swapped both out for spares without any change in behavior./
>> What about the power supply?=20
>>=20
> There are multiple other devices and the system board on that supply
> (and thus voltage rails) but it too has been swapped out without
> change.  In fact at this point other than the system board and RAM
> (which is ECC, and is showing no errors in the system's BMC log)
> /everything /in the server case (HBA, SATA expander, backplane, power
> supply and cables) has been swapped for spares.  No change in =
behavior.
>=20
> Note that with 20.0.7.0 firmware in the HBA instead of a unit =
attention
> I get a *controller* reset (!!!) which detaches some random number of
> devices from ZFS's point of view before it comes back up (depending on
> what's active at the time) which is potentially catastrophic if it =
hits
> the system pool.  I immediately went back to 19.0.0.0 firmware on the
> HBA; I had upgraded to 20.0.7.0 since there had been good reports of
> stability with it when I first saw this, thinking there was a drive
> change that might have resulted in issues with it when running 19.0
> firmware on the card.

I have a system running 12.0-RELEASE-p1 with a LSI2008, 15 SAS disks and =
a SATA SSD
and I haven=E2=80=99t seen any problems. This is heavily loaded with =
just 8 GB of memory and a lot
of stuff running.=20

mps0: <Avago Technologies (LSI) SAS2008> port 0x9000-0x90ff mem =
0xdfff0000-0xdff
fffff,0xdff80000-0xdffbffff irq 17 at device 0.0 numa-domain 0 on pci4
mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: =
185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR>

ses0:
	Enclosure Name: LSILOGIC SASX28 A.0 5021

> This system was completely stable for over a year on 11.1-STABLE and =
in
> fact hadn't been rebooted or logged a single "event" in over six =
months;
> the problems started immediately upon upgrade to 11.2-STABLE and
> persists on 12.0-STABLE.  The disks in question haven't changed either
> (so it can't be a difference in firmware that is in a newer purchased
> disk, for example.)

But you are right, a panic because of a disk problem points to a bug. As =
long as the ZFS
pool is usable, trouble with one of its disks should just be logged. =
Unless of course
the disk is used for swap or the disk failure leads to the system being =
unable to=20
complete a page in. Again, it shouldn=E2=80=99t happen.

> I'm thinking perhaps *something* in the codebase change made the HBA =
->
> SAS Expander combination trouble where it wasn't before; I've got a
> couple of 16i HBAs on the way which will allow me to remove the SAS
> expander to see if that causes the problem to disappear.  I've got a
> bunch of these Lenovo expanders and have been using them without any
> sort of trouble in multiple machines; it's only when I went beyond =
11.1
> that I started having trouble with them.

It might be some backplane misbehavior triggering a bug, complicated.





Borja.






Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1CCD5D4C-BE41-49DE-AF87-40EBFA8038B1>