From owner-freebsd-net@freebsd.org  Fri Jan 31 19:45:10 2020
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8AA621EAC04
 for <freebsd-net@mailman.nyi.freebsd.org>;
 Fri, 31 Jan 2020 19:45:10 +0000 (UTC)
 (envelope-from ricera10@gmail.com)
Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com
 [209.85.219.52])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 488SP03Cl5z48LS;
 Fri, 31 Jan 2020 19:45:08 +0000 (UTC)
 (envelope-from ricera10@gmail.com)
Received: by mail-qv1-f52.google.com with SMTP id dp13so3819965qvb.7;
 Fri, 31 Jan 2020 11:45:08 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=GIoh4x3/LTC7BlaaOuR75SVYJRFzkSkc8OPz3etzNlc=;
 b=OWjyWeBLlonu5vnvr1plBbftZGHjkOX18DpiepIVneNO0hQy/apRydEXOPa8Q5lE+x
 9H/2i4R8zwB1aXlDpO0BBldVUxFzdZUzTL3iTbtdmDDsE8bF92ETFC7hSMRd8+g7UmZP
 wD1aQwmjjRlAZfreu332HAJbKC1fi5OfCDQ1vZk3sKKbOsMhHhhLv6Jp2i9NHMavpa/4
 5wBmI2Y29j9fYUC1m/8C7qIIHgmpCVmVNpgmwp1IfPp+YHiw5pV8h/E/3TGXxbSDLzDH
 hIe8lqOoPmOrmBkXPiTEjPAbgoxbgvAOFK6AlSev9wtR6v7A8VP91CfOrDwQqBjBVoL8
 fNSg==
X-Gm-Message-State: APjAAAXkLE0I0VWuAN9HQSsr+J1c40jobJf7eC4fS5lfAqN+HwEX/I+S
 piT4HE1lcvlA9ojgOL0zfnvuQx3i3SY=
X-Google-Smtp-Source: APXvYqwFXtoXVRx1iPRneR2qqBqDDArGa9TqK2mU7cLCbwbes/mc3w2zfaC0tzvL+E7C2cc/H94NtA==
X-Received: by 2002:ad4:4e73:: with SMTP id ec19mr11540504qvb.78.1580499906740; 
 Fri, 31 Jan 2020 11:45:06 -0800 (PST)
Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com.
 [209.85.160.171])
 by smtp.gmail.com with ESMTPSA id y26sm5190275qtv.28.2020.01.31.11.45.06
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Fri, 31 Jan 2020 11:45:06 -0800 (PST)
Received: by mail-qt1-f171.google.com with SMTP id w8so6313113qts.11;
 Fri, 31 Jan 2020 11:45:06 -0800 (PST)
X-Received: by 2002:ac8:4445:: with SMTP id m5mr12509772qtn.124.1580499906039; 
 Fri, 31 Jan 2020 11:45:06 -0800 (PST)
MIME-Version: 1.0
References: <CAKdFRZjxp=mTkUzFU8qsacP86OQOC9vCDCQ+O2iF7svRRGDK8w@mail.gmail.com>
 <0e2e97f2-df75-3c6f-9bdd-e8c2ab7bf79e@selasky.org>
 <CAKdFRZi3UoRuz=OXnBG=NVcJe605x9OwrLmdCyD98mDeTpbf0Q@mail.gmail.com>
 <a6523ed6-9d61-d1b4-5822-5787cf5c0e43@selasky.org>
 <20200130030911.GA15281@spy>
In-Reply-To: <20200130030911.GA15281@spy>
From: Eric Joyner <erj@freebsd.org>
Date: Fri, 31 Jan 2020 11:44:30 -0800
X-Gmail-Original-Message-ID: <CA+b0zg-1CQ81dsNGv_O3ebLLko6Piei0A1NCPZUT5JH8EOyntw@mail.gmail.com>
Message-ID: <CA+b0zg-1CQ81dsNGv_O3ebLLko6Piei0A1NCPZUT5JH8EOyntw@mail.gmail.com>
Subject: Re: Issue with epoch_drain_callbacks and unloading iavf(4) [using
 iflib]
To: Mark Johnston <markj@freebsd.org>
Cc: Hans Petter Selasky <hps@selasky.org>, freebsd-net@freebsd.org
X-Rspamd-Queue-Id: 488SP03Cl5z48LS
X-Spamd-Bar: -
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of ricera10@gmail.com designates
 209.85.219.52 as permitted sender) smtp.mailfrom=ricera10@gmail.com
X-Spamd-Result: default: False [-1.97 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_TLS_ALL(0.00)[];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17];
 NEURAL_HAM_LONG(-1.00)[-0.999,0];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 DMARC_NA(0.00)[freebsd.org];
 NEURAL_HAM_MEDIUM(-1.00)[-0.997,0]; TO_DN_SOME(0.00)[];
 URI_COUNT_ODD(1.00)[5]; RCVD_COUNT_THREE(0.00)[4];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_NONE(0.00)[52.219.85.209.list.dnswl.org : 127.0.5.0];
 IP_SCORE(-0.97)[ipnet: 209.85.128.0/17(-3.05), asn: 15169(-1.77), country:
 US(-0.05)]; 
 FORGED_SENDER(0.30)[erj@freebsd.org,ricera10@gmail.com];
 RWL_MAILSPIKE_POSSIBLE(0.00)[52.219.85.209.rep.mailspike.net : 127.0.0.17];
 MIME_TRACE(0.00)[0:+,1:+,2:~]; R_DKIM_NA(0.00)[];
 FREEMAIL_ENVFROM(0.00)[gmail.com];
 ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US];
 FROM_NEQ_ENVFROM(0.00)[erj@freebsd.org,ricera10@gmail.com]
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.29
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 31 Jan 2020 19:45:10 -0000

On Wed, Jan 29, 2020 at 7:09 PM Mark Johnston <markj@freebsd.org> wrote:

> On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote:
> > On 2020-01-29 22:44, Eric Joyner wrote:
> > > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky <hps@selasky.org>
> wrote:
> > >
> > > > On 2020-01-29 22:30, Eric Joyner wrote:
> > > > > Hi freebsd-net,
> > > > >
> > > > > We've encountered an issue with unloading the iavf(4) driver on
> FreeBSD
> > > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we send
> heavy
> > > > > traffic to iavf1 and try to kldunload the driver, the kldunload
> process
> > > > > hangs on iavf0 until iavf1 stops receiving traffic.
> > > > >
> > > > > After some debugging, it looks like epoch_drain_callbacks() [via
> > > > > if_detach_internal()] tries to switch CPUs to run on one that
> iavf1 is
> > > > > using for RX processing, but since iavf1 is busy, it can't make the
> > > > switch,
> > > > > so cpu_switch() just hangs and nothing happens until iavf1's RX
> thread
> > > > > stops being busy.
> > > > >
> > > > > I can work around this by inserting a kern_yield(PRI_USER)
> somewhere in
> > > > one
> > > > > of the iavf txrx functions that iflib calls into (e.g.
> > > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone
> know
> > > > what
> > > > > to do to prevent this from happening?
> > > > >
> > > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher
> > > > priority
> > > > > than the PI_SOFT used in the group taskqueues used in iflib's RX
> > > > processing?
> > > > >
> > > >
> > > > Hi,
> > > >
> > > > Which scheduler is this? ULE or BSD?
> > > >
> > > > EPOCH(9) expects some level of round-robin scheduling on the same
> > > > priority level. Setting a higher priority on EPOCH(9) might cause
> epoch
> > > > to start spinning w/o letting the lower priority thread which holds
> the
> > > > EPOCH() section to finish.
> > > >
> > > > --HPS
> > > >
> > > >
> > > Hi Hans,
> > >
> > > kern.sched.name gives me "ULE"
> > >
> >
> > Hi Eric,
> >
> > epoch_drain_callbacks() depends on that epoch_call_task() gets execution
> > which is executed from a GTASKQUEUE at PI_SOFT. Also
> epoch_drain_callbacks()
> > runs at the priority of the calling thread, and if this is lower than
> > PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work.
>
> I think we can fix this so that the epoch_drain_callbacks() caller's
> priority does not matter.  Eric, can you try the patch below?  It is a
> bit hacky, but the idea is to ensure that all pending callbacks can be
> invoked, and then wait for each cb task to run though at least once.  In
> your case I think the callback task threads should still be able to run.
>
> > For a single CPU system you will be toast in this situation regardless if
> > there is no free time on a CPU for EPOCH().
> >
> > In general if epoch_call_task() doesn't get execution time, you will
> have a
> > problem.
>
> That's not the issue here, though there is at least one related problem:
> if a thread in an epoch read section is preempted and the CPU is
> consumed by a high priority receive processing thread, for example due
> to a DOS, we have no mechanism to boost the priority of the preempted
> thread except by calling epoch_synchronize_wait().  In other words,
> nothing guarantees that deferred callbacks will eventually get executed
> in such a scenario.
>
> diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c
> index 0a477e1d6f7b..4ed7e7e11a3e 100644
> --- a/sys/kern/subr_epoch.c
> +++ b/sys/kern/subr_epoch.c
> @@ -74,15 +74,19 @@ typedef struct epoch_record {
>         volatile struct epoch_tdlist er_tdlist;
>         volatile uint32_t er_gen;
>         uint32_t er_cpuid;
> +       int er_drain_state;
>  } __aligned(EPOCH_ALIGN)     *epoch_record_t;
>
> +#define        EPOCH_DRAIN_START       2
> +#define        EPOCH_DRAIN_RUNNING     1
> +#define        EPOCH_DRAIN_DONE        0
> +
>  struct epoch {
>         struct ck_epoch e_epoch __aligned(EPOCH_ALIGN);
>         epoch_record_t e_pcpu_record;
>         int     e_idx;
>         int     e_flags;
>         struct sx e_drain_sx;
> -       struct mtx e_drain_mtx;
>         volatile int e_drain_count;
>         const char *e_name;
>  };
> @@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags)
>         epoch->e_idx = epoch_count;
>         epoch->e_name = name;
>         sx_init(&epoch->e_drain_sx, "epoch-drain-sx");
> -       mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF);
>         allepochs[epoch_count++] = epoch;
>         return (epoch);
>  }
> @@ -348,7 +351,6 @@ epoch_free(epoch_t epoch)
>         allepochs[epoch->e_idx] = NULL;
>         epoch_wait(global_epoch);
>         uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record);
> -       mtx_destroy(&epoch->e_drain_mtx);
>         sx_destroy(&epoch->e_drain_sx);
>         free(epoch, M_EPOCH);
>  }
> @@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused)
>         epoch_t epoch;
>         ck_stack_t cb_stack;
>         int i, npending, total;
> +       bool draining;
> +
> +       KASSERT(curthread->td_pinned > 0,
> +           ("%s: callback task thread is not pinned", __func__));
>
>         ck_stack_init(&cb_stack);
>         critical_enter();
>         epoch_enter(global_epoch);
> -       for (total = i = 0; i < epoch_count; i++) {
> +       for (total = i = 0, draining = false; i < epoch_count; i++) {
>                 if (__predict_false((epoch = allepochs[i]) == NULL))
>                         continue;
>                 er = epoch_currecord(epoch);
> +               if (atomic_load_int(&er->er_drain_state) ==
> EPOCH_DRAIN_START) {
> +                       atomic_store_int(&er->er_drain_state,
> +                           EPOCH_DRAIN_RUNNING);
> +                       draining = true;
> +               }
> +
>                 record = &er->er_record;
>                 if ((npending = record->n_pending) == 0)
>                         continue;
> @@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused)
>                 next = CK_STACK_NEXT(cursor);
>                 entry->function(entry);
>         }
> +
> +       if (__predict_false(draining)) {
> +               epoch_enter(global_epoch);
> +               for (i = 0; i < epoch_count; i++) {
> +                       if (__predict_false((epoch = allepochs[i]) ==
> NULL))
> +                               continue;
> +                       er = epoch_currecord(epoch);
> +                       if (atomic_load_int(&er->er_drain_state) ==
> +                           EPOCH_DRAIN_RUNNING)
> +                               atomic_store_int(&er->er_drain_state,
> +                                   EPOCH_DRAIN_DONE);
> +               }
> +               epoch_exit(global_epoch);
> +       }
>  }
>
>  int
> @@ -769,27 +795,18 @@ in_epoch(epoch_t epoch)
>  }
>
>  static void
> -epoch_drain_cb(struct epoch_context *ctx)
> +epoch_drain_handler(struct ck_epoch *global __unused,
> +    ck_epoch_record_t *cr __unused, void *arg __unused)
>  {
> -       struct epoch *epoch =
> -           __containerof(ctx, struct epoch_record,
> er_drain_ctx)->er_parent;
> -
> -       if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) {
> -               mtx_lock(&epoch->e_drain_mtx);
> -               wakeup(epoch);
> -               mtx_unlock(&epoch->e_drain_mtx);
> -       }
> +       maybe_yield();
>  }
>
>  void
>  epoch_drain_callbacks(epoch_t epoch)
>  {
>         epoch_record_t er;
> -       struct thread *td;
> -       int was_bound;
> -       int old_pinned;
> -       int old_cpu;
> -       int cpu;
> +       int cpu, state;
> +       bool pending;
>
>         WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
>             "epoch_drain_callbacks() may sleep!");
> @@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch)
>                 return;
>  #endif
>         DROP_GIANT();
> -
>         sx_xlock(&epoch->e_drain_sx);
> -       mtx_lock(&epoch->e_drain_mtx);
>
> -       td = curthread;
> -       thread_lock(td);
> -       old_cpu = PCPU_GET(cpuid);
> -       old_pinned = td->td_pinned;
> -       was_bound = sched_is_bound(td);
> -       sched_unbind(td);
> -       td->td_pinned = 0;
> +       /* Make sure that all pending callbacks are available. */
> +       ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler,
> NULL);
>
> -       CPU_FOREACH(cpu)
> -               epoch->e_drain_count++;
>         CPU_FOREACH(cpu) {
>                 er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
> -               sched_bind(td, cpu);
> -               epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx);
> +               atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START);
> +               GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task));
>         }
>
> -       /* restore CPU binding, if any */
> -       if (was_bound != 0) {
> -               sched_bind(td, old_cpu);
> -       } else {
> -               /* get thread back to initial CPU, if any */
> -               if (old_pinned != 0)
> -                       sched_bind(td, old_cpu);
> -               sched_unbind(td);
> -       }
> -       /* restore pinned after bind */
> -       td->td_pinned = old_pinned;
> -
> -       thread_unlock(td);
> -
> -       while (epoch->e_drain_count != 0)
> -               msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0);
> +       do {
> +               pending = false;
> +               CPU_FOREACH(cpu) {
> +                       er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
> +                       state = atomic_load_int(&er->er_drain_state);
> +                       if (state != EPOCH_DRAIN_DONE)
> +                               pending = true;
> +               }
> +               pause("edrain", 1);
> +       } while (pending);
>
> -       mtx_unlock(&epoch->e_drain_mtx);
>         sx_xunlock(&epoch->e_drain_sx);
> -
>         PICKUP_GIANT();
>  }
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org"
>

Hi Mark,

I applied your patch to a FreeBSD-current VM and reran the test I was
running originally, and I still see the same behavior.

Your patch doesn't appear to fix the problem.

- Eric