Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 3 Aug 2021 17:27:51 +0200
From:      Franco Fichtner <franco@lastsummer.de>
To:        Kevin Bowling <kevin.bowling@kev009.com>
Cc:        FreeBSD Net <freebsd-net@freebsd.org>
Subject:   Re: igb(4) and VLAN issue?
Message-ID:  <ED4BA1DF-DE8C-4006-9761-5A05A555543C@lastsummer.de>
In-Reply-To: <CAK7dMtCJhKVo8agr_VGbtGHZeKK8_8ip%2B6bY_yaW45wo42caZQ@mail.gmail.com>
References:  <CAK7dMtCJhKVo8agr_VGbtGHZeKK8_8ip%2B6bY_yaW45wo42caZQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi Kevin,

[RESENT TO MAILING LIST AS SUBSCRIBER]

> On 2. Aug 2021, at 7:51 PM, Kevin Bowling <kevin.bowling@kev009.com> =
wrote:
>=20
> I caught wind that an igb(4) commit I've done to main and that has
> been in stable/12 for a few months seems to be causing a regression on
> opnsense.  The commit in question is
> =
https://cgit.freebsd.org/src/commit/?id=3Deea55de7b10808b86277d7fdbed2d05d=
3c6db1b2
>=20
> The report is at:
> https://forum.opnsense.org/index.php?topic=3D23867.0

Looks like I spoke to soon earlier.  This is a weird one for sure.  :)

So first of all this causes an ifconfig hang for VLAN/LAGG combo =
creation,
but later reports were coming in about ahci errors and cam timeouts.
Some reported the instabilities start with using netmap, but later =
others
confirmed the same for high load scenarios without netmap in use.

The does not appear to happen when MSIX is disabled, e.g.:

# sysctl -a | grep dev.igb | grep msix
dev.igb.5.iflib.disable_msix: 1
dev.igb.4.iflib.disable_msix: 1
dev.igb.3.iflib.disable_msix: 1
dev.igb.2.iflib.disable_msix: 1
dev.igb.1.iflib.disable_msix: 1
dev.igb.0.iflib.disable_msix: 1

What's also being linked to this is some form of softraid misbehaving
and the general tendency for cheaper hardware with particular igb
chipsets.

> I haven't heard of this issue elsewhere and cannot replicate it on my
> I210s running main.  I've gone over the code changes line by line
> several times and verified all the logic and register writes and it
> all looks correct to my understanding.  The only hypothesis I have at
> the moment is it may be some subtle timing issue since VLAN changes
> unnecessarily restart the interface on e1000 until I push in a work in
> progress to stop doing that.

I also have no way of reproducing this locally, but the community is
probably willing to give any kernel change a try that would address
the problem without havinbg to back out the commit in question.

> I'd like to see the output of all the processes or at least the
> process configuring the VLANs to see where it is stuck.  Franco, do
> you have the ability to 'control+t' there or otherwise set up a break
> into a debugger?  Stacktraces would be a great start but a core and a
> kernel may be necessary if it isn't obvious.

Let me see if I can deliver on this easily.


Cheers,
Franco




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?ED4BA1DF-DE8C-4006-9761-5A05A555543C>