Date: Mon, 14 Mar 2022 16:38:16 +0100 From: Johan Hendriks <joh.hendriks@gmail.com> To: Kristof Provost <kp@FreeBSD.org>, Michael Gmelin <grembo@freebsd.org> Cc: "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>, "Patrick M. Hausen" <hausen@punkt.de>, freeBSD-net <freebsd-net@freebsd.org> Subject: Re: epair and vnet jail loose connection. Message-ID: <d35c5f27-5d19-9f17-a2ff-947043a7254c@gmail.com> In-Reply-To: <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org> References: <797A280E-5DF2-4276-BB72-E4E1053A19FA@lists.zabbadoz.net> <6086BA6D-3D54-4851-B636-3B32FACB35E9@freebsd.org> <3B5E2D6F-5444-4448-B7C3-704E294368C3@lists.zabbadoz.net> <20220314144451.35f803a9.grembo@freebsd.org> <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format. --------------S0ZHVqMT6lBFx5c85G4rrsWY Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 14/03/2022 16:09, Kristof Provost wrote: > > On 14 Mar 2022, at 7:44, Michael Gmelin wrote: > > On Sun, 13 Mar 2022 17:53:44 +0000 > "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote: > > On 13 Mar 2022, at 17:45, Michael Gmelin wrote: > > On 13. Mar 2022, at 18:16, Bjoern A. Zeeb > <bzeeb-lists@lists.zabbadoz.net> wrote: > > On 13 Mar 2022, at 16:33, Michael Gmelin wrote: > > It's important to point out that this only happens > with > kern.ncpu>1. With kern.ncpu==1 nothing gets stuck. > > This perfectly fits into the picture, since, as > pointed out by > Johan, > the first commit that is affected[0] is about > multicore support. > > Ignore my ignorance, what is the default of > net.isr.maxthreads and > net.isr.bindthreads (in stable/13) these days? > > My tests were on CURRENT and I’m afk, but according to > cgit[0][1], > max is 1 and bind is 0. > > Would it make sense to repeat the test with max=-1? > > I’d say yes, I’d also bind, but that’s just me. > > I would almost assume Kristof running with -1 by default (but > he can > chime in on that). > > I tried various configuration permutations, all with ncpu=2: > > - 14.0-CURRENT #0 main-n253697-f1d450ddee6 > - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb > - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2 > - net.isr.bindthreads: -1, 0, 1, 2 > - net.isr.dispatch: direct, deferred > > All resulting in the same behavior (hang after a few seconds). > They all > work ok when running on a single core instance (threads=1 in this > case). > > I also ran the same test on 13.0-RELEASE-p7 for > comparison (unsurprisingly, it's ok). > > I placed the script to reproduce the issue on freefall for your > convenience, so running it is as simple as: > > fetch https://people.freebsd.org/~grembo/hang_epair.sh > # inspect content > sh hang_epair.sh > > or, if you feel lucky > > fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh > > > With that script I can also reproduce the problem. > > I’ve experimented with this hack: > > |diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c index > c39434b31b9f..1e6bb07ccc4e 100644 --- a/sys/net/if_epair.c +++ > b/sys/net/if_epair.c @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet > *ifp, u_long cmd, caddr_t data) case SIOCSIFMEDIA: case SIOCGIFMEDIA: > + printf("KP: %s() SIOCGIFMEDIA\n", __func__); sc = ifp->if_softc; + > taskqueue_enqueue(epair_tasks.tq[0], &sc->queues[0].tx_task); + error > = ifmedia_ioctl(ifp, ifr, &sc->media, cmd); break; | > > That kicks the receive code whenever I |ifconfig epair0a|, and I see a > little more traffic every time I do so. > That suggests pretty strongly that there’s an issue with how we > dispatch work to the handler thread. So presumably there’s a race > between epair_menq() and epair_tx_start_deferred(). > > epair_menq() tries to only enqueue the receive work if there’s nothing > in the buf_ring, on the grounds that if there is the previous packet > scheduled the work. Clearly there’s an issue there. > > I’ll try to dig into that in the next few days. > > Kristof > I am willing to try patches, so if you have them i can try them on 14-HEAD, 13-STABLEand 13.1-PRERELEASE. --------------S0ZHVqMT6lBFx5c85G4rrsWY Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> </head> <body> <p><br> </p> <div class="moz-cite-prefix">On 14/03/2022 16:09, Kristof Provost wrote:<br> </div> <blockquote type="cite" cite="mid:A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <div style="font-family: sans-serif;"> <div class="markdown" style="white-space: normal;"> <p dir="auto">On 14 Mar 2022, at 7:44, Michael Gmelin wrote:</p> </div> <div class="plaintext" style="white-space: normal;"> <blockquote style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136BCE; color: #136BCE;"> <p dir="auto">On Sun, 13 Mar 2022 17:53:44 +0000 <br> "Bjoern A. Zeeb" <a class="moz-txt-link-rfc2396E" href="mailto:bzeeb-lists@lists.zabbadoz.net"><bzeeb-lists@lists.zabbadoz.net></a> wrote:</p> <blockquote style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"> <p dir="auto">On 13 Mar 2022, at 17:45, Michael Gmelin wrote:</p> <blockquote style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"> <blockquote style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"> <p dir="auto">On 13. Mar 2022, at 18:16, Bjoern A. Zeeb <br> <a class="moz-txt-link-rfc2396E" href="mailto:bzeeb-lists@lists.zabbadoz.net"><bzeeb-lists@lists.zabbadoz.net></a> wrote:</p> <p dir="auto">On 13 Mar 2022, at 16:33, Michael Gmelin wrote:</p> <blockquote style="margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"> <p dir="auto">It's important to point out that this only happens with <br> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.</p> <p dir="auto">This perfectly fits into the picture, since, as pointed out by <br> Johan, <br> the first commit that is affected[0] is about multicore support.</p> </blockquote> <p dir="auto">Ignore my ignorance, what is the default of net.isr.maxthreads and <br> net.isr.bindthreads (in stable/13) these days?</p> </blockquote> <p dir="auto">My tests were on CURRENT and I’m afk, but according to cgit[0][1], <br> max is 1 and bind is 0.</p> <p dir="auto">Would it make sense to repeat the test with max=-1?</p> </blockquote> <p dir="auto">I’d say yes, I’d also bind, but that’s just me.</p> <p dir="auto">I would almost assume Kristof running with -1 by default (but he can <br> chime in on that).</p> </blockquote> <p dir="auto">I tried various configuration permutations, all with ncpu=2:</p> <p dir="auto">- 14.0-CURRENT #0 main-n253697-f1d450ddee6 <br> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb <br> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2 <br> - net.isr.bindthreads: -1, 0, 1, 2 <br> - net.isr.dispatch: direct, deferred</p> <p dir="auto">All resulting in the same behavior (hang after a few seconds). They all <br> work ok when running on a single core instance (threads=1 in this case).</p> <p dir="auto">I also ran the same test on 13.0-RELEASE-p7 for <br> comparison (unsurprisingly, it's ok).</p> <p dir="auto">I placed the script to reproduce the issue on freefall for your <br> convenience, so running it is as simple as:</p> <p dir="auto"> fetch <a href="https://people.freebsd.org/~grembo/hang_epair.sh" moz-do-not-send="true" class="moz-txt-link-freetext">https://people.freebsd.org/~grembo/hang_epair.sh</a> <br> # inspect content <br> sh hang_epair.sh</p> <p dir="auto">or, if you feel lucky</p> <p dir="auto"> fetch -o - <a href="https://people.freebsd.org/~grembo/hang_epair.sh" moz-do-not-send="true" class="moz-txt-link-freetext">https://people.freebsd.org/~grembo/hang_epair.sh</a> | sh</p> <br> </blockquote> </div> <div class="markdown" style="white-space: normal;"> <p dir="auto">With that script I can also reproduce the problem.</p> <p dir="auto">I’ve experimented with this hack:</p> <pre style="margin-left: 15px; margin-right: 15px; padding: 5px; border: thin solid gray; overflow-x: auto; max-width: 90vw; background-color: #E4E4E4;"><code>diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c index c39434b31b9f..1e6bb07ccc4e 100644 --- a/sys/net/if_epair.c +++ b/sys/net/if_epair.c @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data) case SIOCSIFMEDIA: case SIOCGIFMEDIA: + printf("KP: %s() SIOCGIFMEDIA\n", __func__); sc = ifp->if_softc; + taskqueue_enqueue(epair_tasks.tq[0], &sc->queues[0].tx_task); + error = ifmedia_ioctl(ifp, ifr, &sc->media, cmd); break; </code></pre> <p dir="auto">That kicks the receive code whenever I <code>ifconfig epair0a</code>, and I see a little more traffic every time I do so.<br> That suggests pretty strongly that there’s an issue with how we dispatch work to the handler thread. So presumably there’s a race between epair_menq() and epair_tx_start_deferred().</p> <p dir="auto">epair_menq() tries to only enqueue the receive work if there’s nothing in the buf_ring, on the grounds that if there is the previous packet scheduled the work. Clearly there’s an issue there.</p> <p dir="auto">I’ll try to dig into that in the next few days.</p> <p dir="auto">Kristof</p> </div> </div> </blockquote> I am willing to try patches, so if you have them i can try them on 14-HEAD, 13-STABLEand 13.1-PRERELEASE.<br> <br> <br> </body> </html> --------------S0ZHVqMT6lBFx5c85G4rrsWY--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?d35c5f27-5d19-9f17-a2ff-947043a7254c>