Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 14 Mar 2022 16:38:16 +0100
From:      Johan Hendriks <joh.hendriks@gmail.com>
To:        Kristof Provost <kp@FreeBSD.org>, Michael Gmelin <grembo@freebsd.org>
Cc:        "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net>, "Patrick M. Hausen" <hausen@punkt.de>, freeBSD-net <freebsd-net@freebsd.org>
Subject:   Re: epair and vnet jail loose connection.
Message-ID:  <d35c5f27-5d19-9f17-a2ff-947043a7254c@gmail.com>
In-Reply-To: <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org>
References:  <797A280E-5DF2-4276-BB72-E4E1053A19FA@lists.zabbadoz.net> <6086BA6D-3D54-4851-B636-3B32FACB35E9@freebsd.org> <3B5E2D6F-5444-4448-B7C3-704E294368C3@lists.zabbadoz.net> <20220314144451.35f803a9.grembo@freebsd.org> <A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
This is a multi-part message in MIME format.
--------------S0ZHVqMT6lBFx5c85G4rrsWY
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit


On 14/03/2022 16:09, Kristof Provost wrote:
>
> On 14 Mar 2022, at 7:44, Michael Gmelin wrote:
>
>     On Sun, 13 Mar 2022 17:53:44 +0000
>     "Bjoern A. Zeeb" <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>         On 13 Mar 2022, at 17:45, Michael Gmelin wrote:
>
>                 On 13. Mar 2022, at 18:16, Bjoern A. Zeeb
>                 <bzeeb-lists@lists.zabbadoz.net> wrote:
>
>                 On 13 Mar 2022, at 16:33, Michael Gmelin wrote:
>
>                     It's important to point out that this only happens
>                     with
>                     kern.ncpu>1. With kern.ncpu==1 nothing gets stuck.
>
>                     This perfectly fits into the picture, since, as
>                     pointed out by
>                     Johan,
>                     the first commit that is affected[0] is about
>                     multicore support.
>
>                 Ignore my ignorance, what is the default of
>                 net.isr.maxthreads and
>                 net.isr.bindthreads (in stable/13) these days?
>
>             My tests were on CURRENT and I’m afk, but according to
>             cgit[0][1],
>             max is 1 and bind is 0.
>
>             Would it make sense to repeat the test with max=-1?
>
>         I’d say yes, I’d also bind, but that’s just me.
>
>         I would almost assume Kristof running with -1 by default (but
>         he can
>         chime in on that).
>
>     I tried various configuration permutations, all with ncpu=2:
>
>     - 14.0-CURRENT #0 main-n253697-f1d450ddee6
>     - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
>     - net.isr.maxthreads: -1 (which results in 2 threads), 1, 2
>     - net.isr.bindthreads: -1, 0, 1, 2
>     - net.isr.dispatch: direct, deferred
>
>     All resulting in the same behavior (hang after a few seconds).
>     They all
>     work ok when running on a single core instance (threads=1 in this
>     case).
>
>     I also ran the same test on 13.0-RELEASE-p7 for
>     comparison (unsurprisingly, it's ok).
>
>     I placed the script to reproduce the issue on freefall for your
>     convenience, so running it is as simple as:
>
>     fetch https://people.freebsd.org/~grembo/hang_epair.sh
>     # inspect content
>     sh hang_epair.sh
>
>     or, if you feel lucky
>
>     fetch -o - https://people.freebsd.org/~grembo/hang_epair.sh | sh
>
>
> With that script I can also reproduce the problem.
>
> I’ve experimented with this hack:
>
> |diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c index 
> c39434b31b9f..1e6bb07ccc4e 100644 --- a/sys/net/if_epair.c +++ 
> b/sys/net/if_epair.c @@ -415,7 +415,10 @@ epair_ioctl(struct ifnet 
> *ifp, u_long cmd, caddr_t data) case SIOCSIFMEDIA: case SIOCGIFMEDIA: 
> + printf("KP: %s() SIOCGIFMEDIA\n", __func__); sc = ifp->if_softc; + 
> taskqueue_enqueue(epair_tasks.tq[0], &sc->queues[0].tx_task); + error 
> = ifmedia_ioctl(ifp, ifr, &sc->media, cmd); break; |
>
> That kicks the receive code whenever I |ifconfig epair0a|, and I see a 
> little more traffic every time I do so.
> That suggests pretty strongly that there’s an issue with how we 
> dispatch work to the handler thread. So presumably there’s a race 
> between epair_menq() and epair_tx_start_deferred().
>
> epair_menq() tries to only enqueue the receive work if there’s nothing 
> in the buf_ring, on the grounds that if there is the previous packet 
> scheduled the work. Clearly there’s an issue there.
>
> I’ll try to dig into that in the next few days.
>
> Kristof
>
I am willing to try patches, so if you have them i can try them on 
14-HEAD, 13-STABLEand 13.1-PRERELEASE.


--------------S0ZHVqMT6lBFx5c85G4rrsWY
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 14/03/2022 16:09, Kristof Provost
      wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:A7AF5067-8E41-4FFA-A69C-EE347466F5C6@FreeBSD.org">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div style="font-family: sans-serif;">
        <div class="markdown" style="white-space: normal;">
          <p dir="auto">On 14 Mar 2022, at 7:44, Michael Gmelin wrote:</p>
        </div>
        <div class="plaintext" style="white-space: normal;">
          <blockquote style="margin: 0 0 5px; padding-left: 5px;
            border-left: 2px solid #136BCE; color: #136BCE;">
            <p dir="auto">On Sun, 13 Mar 2022 17:53:44 +0000
              <br>
              "Bjoern A. Zeeb" <a class="moz-txt-link-rfc2396E" href="mailto:bzeeb-lists@lists.zabbadoz.net">&lt;bzeeb-lists@lists.zabbadoz.net&gt;</a>
              wrote:</p>
            <blockquote style="margin: 0 0 5px; padding-left: 5px;
              border-left: 2px solid #136BCE; border-left-color:
              #4B89CF; color: #4B89CF;">
              <p dir="auto">On 13 Mar 2022, at 17:45, Michael Gmelin
                wrote:</p>
              <blockquote style="margin: 0 0 5px; padding-left: 5px;
                border-left: 2px solid #136BCE; border-left-color:
                #4B89CF; color: #4B89CF;">
                <blockquote style="margin: 0 0 5px; padding-left: 5px;
                  border-left: 2px solid #136BCE; border-left-color:
                  #4B89CF; color: #4B89CF;">
                  <p dir="auto">On 13. Mar 2022, at 18:16, Bjoern A.
                    Zeeb
                    <br>
                    <a class="moz-txt-link-rfc2396E" href="mailto:bzeeb-lists@lists.zabbadoz.net">&lt;bzeeb-lists@lists.zabbadoz.net&gt;</a> wrote:</p>
                  <p dir="auto">On 13 Mar 2022, at 16:33, Michael
                    Gmelin wrote:</p>
                  <blockquote style="margin: 0 0 5px; padding-left: 5px;
                    border-left: 2px solid #136BCE; border-left-color:
                    #4B89CF; color: #4B89CF;">
                    <p dir="auto">It's important to point out that this
                      only happens with
                      <br>
                      kern.ncpu&gt;1. With kern.ncpu==1 nothing gets
                      stuck.</p>
                    <p dir="auto">This perfectly fits into the picture,
                      since, as pointed out by
                      <br>
                      Johan,
                      <br>
                      the first commit that is affected[0] is about
                      multicore support.</p>
                  </blockquote>
                  <p dir="auto">Ignore my ignorance, what is the default
                    of net.isr.maxthreads and
                    <br>
                    net.isr.bindthreads (in stable/13) these days?</p>
                </blockquote>
                <p dir="auto">My tests were on CURRENT and I’m afk, but
                  according to cgit[0][1],
                  <br>
                  max is 1 and bind is 0.</p>
                <p dir="auto">Would it make sense to repeat the test
                  with max=-1?</p>
              </blockquote>
              <p dir="auto">I’d say yes, I’d also bind, but that’s just
                me.</p>
              <p dir="auto">I would almost assume Kristof running with
                -1 by default (but he can
                <br>
                chime in on that).</p>
            </blockquote>
            <p dir="auto">I tried various configuration permutations,
              all with ncpu=2:</p>
            <p dir="auto">- 14.0-CURRENT #0 main-n253697-f1d450ddee6
              <br>
              - 13.1-BETA1 #0 releng/13.1-n249974-ad329796bdb
              <br>
              - net.isr.maxthreads: -1 (which results in 2 threads), 1,
              2
              <br>
              - net.isr.bindthreads: -1, 0, 1, 2
              <br>
              - net.isr.dispatch: direct, deferred</p>
            <p dir="auto">All resulting in the same behavior (hang after
              a few seconds). They all
              <br>
              work ok when running on a single core instance (threads=1
              in this case).</p>
            <p dir="auto">I also ran the same test on 13.0-RELEASE-p7
              for
              <br>
              comparison (unsurprisingly, it's ok).</p>
            <p dir="auto">I placed the script to reproduce the issue on
              freefall for your
              <br>
              convenience, so running it is as simple as:</p>
            <p dir="auto"> fetch <a
                href="https://people.freebsd.org/~grembo/hang_epair.sh"
                moz-do-not-send="true" class="moz-txt-link-freetext">https://people.freebsd.org/~grembo/hang_epair.sh</a>;
              <br>
              # inspect content
              <br>
              sh hang_epair.sh</p>
            <p dir="auto">or, if you feel lucky</p>
            <p dir="auto"> fetch -o - <a
                href="https://people.freebsd.org/~grembo/hang_epair.sh"
                moz-do-not-send="true" class="moz-txt-link-freetext">https://people.freebsd.org/~grembo/hang_epair.sh</a>;
              | sh</p>
            <br>
          </blockquote>
        </div>
        <div class="markdown" style="white-space: normal;">
          <p dir="auto">With that script I can also reproduce the
            problem.</p>
          <p dir="auto">I’ve experimented with this hack:</p>
          <pre style="margin-left: 15px; margin-right: 15px; padding: 5px; border: thin solid gray; overflow-x: auto; max-width: 90vw; background-color: #E4E4E4;"><code>diff --git a/sys/net/if_epair.c b/sys/net/if_epair.c
index c39434b31b9f..1e6bb07ccc4e 100644
--- a/sys/net/if_epair.c
+++ b/sys/net/if_epair.c
@@ -415,7 +415,10 @@ epair_ioctl(struct ifnet *ifp, u_long cmd, caddr_t data)

        case SIOCSIFMEDIA:
        case SIOCGIFMEDIA:
+               printf("KP: %s() SIOCGIFMEDIA\n", __func__);
                sc = ifp-&gt;if_softc;
+               taskqueue_enqueue(epair_tasks.tq[0], &amp;sc-&gt;queues[0].tx_task);
+
                error = ifmedia_ioctl(ifp, ifr, &amp;sc-&gt;media, cmd);
                break;
</code></pre>
          <p dir="auto">That kicks the receive code whenever I <code>ifconfig
              epair0a</code>, and I see a little more traffic every time
            I do so.<br>
            That suggests pretty strongly that there’s an issue with how
            we dispatch work to the handler thread. So presumably
            there’s a race between epair_menq() and
            epair_tx_start_deferred().</p>
          <p dir="auto">epair_menq() tries to only enqueue the receive
            work if there’s nothing in the buf_ring, on the grounds that
            if there is the previous packet scheduled the work. Clearly
            there’s an issue there.</p>
          <p dir="auto">I’ll try to dig into that in the next few days.</p>
          <p dir="auto">Kristof</p>
        </div>
      </div>
    </blockquote>
    I am willing to try patches, so if you have them i can try them on
    14-HEAD, 13-STABLEand 13.1-PRERELEASE.<br>
    <br>
    <br>
  </body>
</html>

--------------S0ZHVqMT6lBFx5c85G4rrsWY--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?d35c5f27-5d19-9f17-a2ff-947043a7254c>