Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Mar 2022 09:09:49 -0600
From:      Kristof Provost <kp@FreeBSD.org>
To:        Michael Gmelin <grembo@freebsd.org>
Cc:        "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>, Johan Hendriks <joh.hendriks@gmail.com>, "Patrick M. Hausen" <hausen@punkt.de>, freeBSD-net <freebsd-net@freebsd.org>
Subject:   Re: epair and vnet jail loose connection.
Message-ID:  <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org>
In-Reply-To: <20220314144451.35f803a9.grembo@freebsd.org>
References:  <797A280E-5DF2-4276-BB72-E4E1053A19FA@lists.zabbadoz.net> <6086BA6D-3D54-4851-B636-3B32FACB35E9@freebsd.org> <3B5E2D6F-5444-4448-B7C3-704E294368C3@lists.zabbadoz.net> <20220314144451.35f803a9.grembo@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--=_MailMate_3AB37302-0AB7-4470-92AA-E00D57EB46B0_=
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit

On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
> On Sun, 13 Mar 2022 17:53:44 +0000
> "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>> On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
>>
>>>> On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
>>>> <bzeeb-lists@lists.zabbadoz.net> wrote:
>>>>
>>>> On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
>>>>> It's important to point out that this only happens with
>>>>> kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.
>>>>>
>>>>> This perfectly fits into the picture, since, as pointed out by
>>>>> Johan,
>>>>> the first commit that is affected[0] is about multicore support.
>>>>
>>>> Ignore my ignorance, what is the default of net.isr.maxthreads and
>>>> net.isr.bindthreads (in stable/13) these days?
>>>>
>>>
>>> My tests were on CURRENT and I’m afk, but according to cgit[0][1],
>>> max is 1 and bind is 0.
>>>
>>> Would it make sense to repeat the test with max=-1?
>>
>> I’d say yes, I’d also bind, but that’s just me.
>>
>> I would almost assume Kristof running with -1 by default (but he can
>> chime in on that).
>
> I tried various configuration permutations, all with ncpu=2:
>
> - 14.0-CURRENT #0 main-n253697-f1d450ddee6
> - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
> - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
> - net.isr.bindthreads: -1, 0, 1, 2
> - net.isr.dispatch: direct, deferred
>
> All resulting in the same behavior (hang after a few seconds). They 
> all
> work ok when running on a single core instance (threads=1 in this 
> case).
>
> I also ran the same test on 13.0-RELEASE-p7 for
> comparison (unsurprisingly, it's ok).
>
> I placed the script to reproduce the issue on freefall for your
> convenience, so running it is as simple as:
>
>     fetch https://people.freebsd.org/~grembo/hang_epair.sh
>     # inspect content
>     sh hang_epair.sh
>
> or, if you feel lucky
>
>     fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh
>
With that script I can also reproduce the problem.

I’ve experimented with this hack:

	diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
	index c39434b31b9f..1e6bb07ccc4e 100644
	--- a/sys/net/if_epair.c
	+++ b/sys/net/if_epair.c
	@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t 
data)

	        case SIOCSIFMEDIA:
	        case SIOCGIFMEDIA:
	+               printf("KP: %s() SIOCGIFMEDIA\n", __func__);
	                sc = ifp->if_softc;
	+               taskqueue_enqueue(epair_tasks.tq[0], 
&sc->queues[0].tx_task);
	+
	                error = ifmedia_ioctl(ifp, ifr, &sc->media, cmd);
	                break;

That kicks the receive code whenever I `ifconfig epair0a`, and I see a 
little more traffic every time I do so.
That suggests pretty strongly that there’s an issue with how we 
dispatch work to the handler thread. So presumably there’s a race 
between epair_menq() and epair_tx_start_deferred().

epair_menq() tries to only enqueue the receive work if there’s nothing 
in the buf_ring, on the grounds that if there is the previous packet 
scheduled the work. Clearly there’s an issue there.

I’ll try to dig into that in the next few days.

Kristof
--=_MailMate_3AB37302-0AB7-4470-92AA-E00D57EB46B0_=
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html>
<html>
<head>
<meta http-equiv=3D"Content-Type" content=3D"text/xhtml; charset=3Dutf-8"=
>
</head>
<body><div style=3D"font-family: sans-serif;"><div class=3D"markdown" sty=
le=3D"white-space: normal;">
<p dir=3D"auto">On 14 Mar 2022, at 7:44, Michael Gmelin wrote:</p>
</div><div class=3D"plaintext" style=3D"white-space: normal;"><blockquote=
 style=3D"margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136=
BCE; color: #136BCE;"><p dir=3D"auto">On Sun, 13 Mar 2022 17:53:44 +0000
<br>
"Bjoern A. Zeeb" &lt;bzeeb-lists@lists.zabbadoz.net&gt; wrote:</p>
<blockquote style=3D"margin: 0 0 5px; padding-left: 5px; border-left: 2px=
 solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"><p dir=3D"au=
to">On 13 Mar 2022, at 17:45, Michael Gmelin wrote:</p>
<blockquote style=3D"margin: 0 0 5px; padding-left: 5px; border-left: 2px=
 solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"><blockquote =
style=3D"margin: 0 0 5px; padding-left: 5px; border-left: 2px solid #136B=
CE; border-left-color: #4B89CF; color: #4B89CF;"><p dir=3D"auto">On 13. M=
ar 2022, at 18:16, Bjoern A. Zeeb
<br>
&lt;bzeeb-lists@lists.zabbadoz.net&gt; wrote:</p>
<p dir=3D"auto">=EF=BB=BFOn 13 Mar 2022, at 16:33, Michael Gmelin wrote:<=
/p>
<blockquote style=3D"margin: 0 0 5px; padding-left: 5px; border-left: 2px=
 solid #136BCE; border-left-color: #4B89CF; color: #4B89CF;"><p dir=3D"au=
to">It's important to point out that this only happens with
<br>
kern.ncpu&gt;1. With kern.ncpu=3D=3D1 nothing gets stuck.</p>
<p dir=3D"auto">This perfectly fits into the picture, since, as pointed o=
ut by
<br>
Johan,
<br>
the first commit that is affected[0] is about multicore support.</p>
</blockquote><p dir=3D"auto">Ignore my ignorance, what is the default of =
net.isr.maxthreads and
<br>
net.isr.bindthreads (in stable/13) these days?</p>
</blockquote><p dir=3D"auto">My tests were on CURRENT and I=E2=80=99m afk=
, but according to cgit[0][1],
<br>
max is 1 and bind is 0.</p>
<p dir=3D"auto">Would it make sense to repeat the test with max=3D-1?</p>=

</blockquote><p dir=3D"auto">I=E2=80=99d say yes, I=E2=80=99d also bind, =
but that=E2=80=99s just me.</p>
<p dir=3D"auto">I would almost assume Kristof running with -1 by default =
(but he can
<br>
chime in on that).</p>
</blockquote><p dir=3D"auto">I tried various configuration permutations, =
all with ncpu=3D2:</p>
<p dir=3D"auto">- 14.0-CURRENT #0 main-n253697-f1d450ddee6
<br>
- 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
<br>
- net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
<br>
- net.isr.bindthreads: -1, 0, 1, 2
<br>
- net.isr.dispatch: direct, deferred</p>
<p dir=3D"auto">All resulting in the same behavior (hang after a few seco=
nds). They all
<br>
work ok when running on a single core instance (threads=3D1 in this case)=
=2E</p>
<p dir=3D"auto">I also ran the same test on 13.0-RELEASE-p7 for
<br>
comparison (unsurprisingly, it's ok).</p>
<p dir=3D"auto">I placed the script to reproduce the issue on freefall fo=
r your
<br>
convenience, so running it is as simple as:</p>
<p dir=3D"auto">    fetch <a href=3D"https://people.freebsd.org/~grembo/h=
ang_epair.sh">https://people.freebsd.org/~grembo/hang_epair.sh</a>;
<br>
    # inspect content
<br>
    sh hang_epair.sh</p>
<p dir=3D"auto">or, if you feel lucky</p>
<p dir=3D"auto">    fetch -o - <a href=3D"https://people.freebsd.org/~gre=
mbo/hang_epair.sh">https://people.freebsd.org/~grembo/hang_epair.sh</a>; |=
 sh</p>
<br></blockquote></div>
<div class=3D"markdown" style=3D"white-space: normal;">
<p dir=3D"auto">With that script I can also reproduce the problem.</p>
<p dir=3D"auto">I=E2=80=99ve experimented with this hack:</p>
<pre style=3D"margin-left: 15px; margin-right: 15px; padding: 5px; border=
: thin solid gray; overflow-x: auto; max-width: 90vw; background-color: #=
E4E4E4;"><code>diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
index c39434b31b9f..1e6bb07ccc4e 100644
--- a/sys/net/if_epair.c
+++ b/sys/net/if_epair.c
@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t d=
ata)

        case SIOCSIFMEDIA:
        case SIOCGIFMEDIA:
+               printf(&quot;KP: %s() SIOCGIFMEDIA\n&quot;, __func__);
                sc =3D ifp-&gt;if_softc;
+               taskqueue_enqueue(epair_tasks.tq[0], &amp;sc-&gt;queues[0=
].tx_task);
+
                error =3D ifmedia_ioctl(ifp, ifr, &amp;sc-&gt;media, cmd)=
;
                break;
</code></pre>
<p dir=3D"auto">That kicks the receive code whenever I <code>ifconfig epa=
ir0a</code>, and I see a little more traffic every time I do so.<br>
That suggests pretty strongly that there=E2=80=99s an issue with how we d=
ispatch work to the handler thread. So presumably there=E2=80=99s a race =
between epair_menq() and epair_tx_start_deferred().</p>
<p dir=3D"auto">epair_menq() tries to only enqueue the receive work if th=
ere=E2=80=99s nothing in the buf_ring, on the grounds that if there is th=
e previous packet scheduled the work. Clearly there=E2=80=99s an issue th=
ere.</p>
<p dir=3D"auto">I=E2=80=99ll try to dig into that in the next few days.</=
p>
<p dir=3D"auto">Kristof</p>

</div></div></body>

</html>

--=_MailMate_3AB37302-0AB7-4470-92AA-E00D57EB46B0_=--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?A7AF5067-8E41-4FFA-A69C-EE347466F5C6>