Date: Tue, 15 Mar 2022 01:02:30 +0100 From: Michael Gmelin <grembo@freebsd.org> To: Kristof Provost <kp@FreeBSD.org> Cc: Michael Gmelin <grembo@freebsd.org>, "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>, Johan Hendriks <joh.hendriks@gmail.com>, "Patrick M. Hausen" <hausen@punkt.de>, freeBSD-net <freebsd-net@freebsd.org> Subject: Re: epair and vnet jail loose connection. Message-ID: <20220315010230.6083dd72.grembo@freebsd.org> In-Reply-To: <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org> References: <797A280E-5DF2-4276-BB72-E4E1053A19FA@lists.zabbadoz.net> <6086BA6D-3D54-4851-B636-3B32FACB35E9@freebsd.org> <3B5E2D6F-5444-4448-B7C3-704E294368C3@lists.zabbadoz.net> <20220314144451.35f803a9.grembo@freebsd.org> <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 14 Mar 2022 09:09:49 -0600 Kristof Provost <kp@FreeBSD.org> wrote: > On 14 Mar 2022, at 7:44, Michael Gmelin wrote: > > On Sun, 13 Mar 2022 17:53:44 +0000 > > "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote: > > > >> On 13 Mar 2022, at 17:45, Michael Gmelin wrote: > >> > >>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb > >>>> <bzeeb-lists@lists.zabbadoz.net> wrote: > >>>> > >>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote: > >>>>> It's important to point out that this only happens with > >>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck. > >>>>> > >>>>> This perfectly fits into the picture, since, as pointed out by > >>>>> Johan, > >>>>> the first commit that is affected[0] is about multicore > >>>>> support. > >>>> > >>>> Ignore my ignorance, what is the default of net.isr.maxthreads > >>>> and net.isr.bindthreads (in stable/13) these days? > >>>> > >>> > >>> My tests were on CURRENT and I’m afk, but according to cgit[0][1], > >>> max is 1 and bind is 0. > >>> > >>> Would it make sense to repeat the test with max=-1? > >> > >> I’d say yes, I’d also bind, but that’s just me. > >> > >> I would almost assume Kristof running with -1 by default (but he > >> can chime in on that). > > > > I tried various configuration permutations, all with ncpu=2: > > > > - 14.0-CURRENT #0 main-n253697-f1d450ddee6 > > - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb > > - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2 > > - net.isr.bindthreads: -1, 0, 1, 2 > > - net.isr.dispatch: direct, deferred > > > > All resulting in the same behavior (hang after a few seconds). They > > all > > work ok when running on a single core instance (threads=1 in this > > case). > > > > I also ran the same test on 13.0-RELEASE-p7 for > > comparison (unsurprisingly, it's ok). > > > > I placed the script to reproduce the issue on freefall for your > > convenience, so running it is as simple as: > > > > fetch https://people.freebsd.org/~grembo/hang_epair.sh > > # inspect content > > sh hang_epair.sh > > > > or, if you feel lucky > > > > fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh > > > With that script I can also reproduce the problem. > > I’ve experimented with this hack: > > diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c > index c39434b31b9f..1e6bb07ccc4e 100644 > --- a/sys/net/if_epair.c > +++ b/sys/net/if_epair.c > @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long > cmd, caddr_t data) > > case SIOCSIFMEDIA: > case SIOCGIFMEDIA: > + printf("KP: %s() SIOCGIFMEDIA\n", __func__); > sc = ifp->if_softc; > + taskqueue_enqueue(epair_tasks.tq[0], > &sc->queues[0].tx_task); > + > error = ifmedia_ioctl(ifp, ifr, &sc->media, > cmd); break; > > That kicks the receive code whenever I `ifconfig epair0a`, and I see > a little more traffic every time I do so. > That suggests pretty strongly that there’s an issue with how we > dispatch work to the handler thread. So presumably there’s a race > between epair_menq() and epair_tx_start_deferred(). > > epair_menq() tries to only enqueue the receive work if there’s > nothing in the buf_ring, on the grounds that if there is the previous > packet scheduled the work. Clearly there’s an issue there. > > I’ll try to dig into that in the next few days. > Hi Kristof, This sounds plausible. I spent a few hours getting familiar with the epair code and came up with a patch that seems to fix the issue at hand (both with and without RSS). I'm not certain that it is a good solution, especially in terms of performance, but I wanted to share it with you anyway, maybe it helps: https://people.freebsd.org/~grembo/epair.patch Best Michael -- Michael Gmelin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20220315010230.6083dd72.grembo>
