Date: Thu, 7 Feb 2019 12:51:03 +0100 From: Borja Marcos <borjam@sarenet.es> To: Karl Denninger <karl@denninger.net> Cc: freebsd-stable@freebsd.org Subject: Re: 9211 (LSI/SAS) issues on 11.2-STABLE Message-ID: <1CCD5D4C-BE41-49DE-AF87-40EBFA8038B1@sarenet.es> In-Reply-To: <76c295f5-fc2b-2c9f-78b1-163939afb24a@denninger.net> References: <7bb25f55-fa77-f67e-11f3-b2240b01e25a@denninger.net> <b50c527c-e7f7-3e64-af3a-e597ec77c021@denninger.net> <9ea70420-0c06-ad9d-e8b7-f9d92fed20d8@denninger.net> <57ddc2f4-681c-e0aa-0484-42cee3876a05@denninger.net> <1FFC1686-E70F-4649-B170-34F90B773918@sarenet.es> <76c295f5-fc2b-2c9f-78b1-163939afb24a@denninger.net>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 6 Feb 2019, at 16:34, Karl Denninger <karl@denninger.net> wrote: >=20 > On 2/6/2019 09:18, Borja Marcos wrote: >>>> Number of Hardware Resets has incremented. There are no other = errors shown: >> What is _exactly_ that value? Is it related to the number of resets = sent from the HBA >> _or_ the device resetting by itself? > Good question. What counts as a "reset"; UNIT ATTENTION is what the > controller receives but whether that's a power reset, a reset = *command* > from the HBA or a firmware crash (yikes!) in the disk I'm not certain. In my youth I wrote software for tape drives. After a reset, no matter = how it was initiated (the device itself or the HBA) the device will give you a UNIT = ATTENTION if I remember well (25 years ago).=20 >>>> I'd throw possible shade at the backplane or cable /but I have = already >>>> swapped both out for spares without any change in behavior./ >> What about the power supply?=20 >>=20 > There are multiple other devices and the system board on that supply > (and thus voltage rails) but it too has been swapped out without > change. In fact at this point other than the system board and RAM > (which is ECC, and is showing no errors in the system's BMC log) > /everything /in the server case (HBA, SATA expander, backplane, power > supply and cables) has been swapped for spares. No change in = behavior. >=20 > Note that with 20.0.7.0 firmware in the HBA instead of a unit = attention > I get a *controller* reset (!!!) which detaches some random number of > devices from ZFS's point of view before it comes back up (depending on > what's active at the time) which is potentially catastrophic if it = hits > the system pool. I immediately went back to 19.0.0.0 firmware on the > HBA; I had upgraded to 20.0.7.0 since there had been good reports of > stability with it when I first saw this, thinking there was a drive > change that might have resulted in issues with it when running 19.0 > firmware on the card. I have a system running 12.0-RELEASE-p1 with a LSI2008, 15 SAS disks and = a SATA SSD and I haven=E2=80=99t seen any problems. This is heavily loaded with = just 8 GB of memory and a lot of stuff running.=20 mps0: <Avago Technologies (LSI) SAS2008> port 0x9000-0x90ff mem = 0xdfff0000-0xdff fffff,0xdff80000-0xdffbffff irq 17 at device 0.0 numa-domain 0 on pci4 mps0: Firmware: 20.00.07.00, Driver: 21.02.00.00-fbsd mps0: IOCCapabilities: = 185c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,IR> ses0: Enclosure Name: LSILOGIC SASX28 A.0 5021 > This system was completely stable for over a year on 11.1-STABLE and = in > fact hadn't been rebooted or logged a single "event" in over six = months; > the problems started immediately upon upgrade to 11.2-STABLE and > persists on 12.0-STABLE. The disks in question haven't changed either > (so it can't be a difference in firmware that is in a newer purchased > disk, for example.) But you are right, a panic because of a disk problem points to a bug. As = long as the ZFS pool is usable, trouble with one of its disks should just be logged. = Unless of course the disk is used for swap or the disk failure leads to the system being = unable to=20 complete a page in. Again, it shouldn=E2=80=99t happen. > I'm thinking perhaps *something* in the codebase change made the HBA = -> > SAS Expander combination trouble where it wasn't before; I've got a > couple of 16i HBAs on the way which will allow me to remove the SAS > expander to see if that causes the problem to disappear. I've got a > bunch of these Lenovo expanders and have been using them without any > sort of trouble in multiple machines; it's only when I went beyond = 11.1 > that I started having trouble with them. It might be some backplane misbehavior triggering a bug, complicated. Borja.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1CCD5D4C-BE41-49DE-AF87-40EBFA8038B1>