From owner-freebsd-net@freebsd.org Fri Jan 31 19:45:10 2020 Return-Path: Delivered-To: freebsd-net@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8AA621EAC04 for ; Fri, 31 Jan 2020 19:45:10 +0000 (UTC) (envelope-from ricera10@gmail.com) Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 488SP03Cl5z48LS; Fri, 31 Jan 2020 19:45:08 +0000 (UTC) (envelope-from ricera10@gmail.com) Received: by mail-qv1-f52.google.com with SMTP id dp13so3819965qvb.7; Fri, 31 Jan 2020 11:45:08 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=GIoh4x3/LTC7BlaaOuR75SVYJRFzkSkc8OPz3etzNlc=; b=OWjyWeBLlonu5vnvr1plBbftZGHjkOX18DpiepIVneNO0hQy/apRydEXOPa8Q5lE+x 9H/2i4R8zwB1aXlDpO0BBldVUxFzdZUzTL3iTbtdmDDsE8bF92ETFC7hSMRd8+g7UmZP wD1aQwmjjRlAZfreu332HAJbKC1fi5OfCDQ1vZk3sKKbOsMhHhhLv6Jp2i9NHMavpa/4 5wBmI2Y29j9fYUC1m/8C7qIIHgmpCVmVNpgmwp1IfPp+YHiw5pV8h/E/3TGXxbSDLzDH hIe8lqOoPmOrmBkXPiTEjPAbgoxbgvAOFK6AlSev9wtR6v7A8VP91CfOrDwQqBjBVoL8 fNSg== X-Gm-Message-State: APjAAAXkLE0I0VWuAN9HQSsr+J1c40jobJf7eC4fS5lfAqN+HwEX/I+S piT4HE1lcvlA9ojgOL0zfnvuQx3i3SY= X-Google-Smtp-Source: APXvYqwFXtoXVRx1iPRneR2qqBqDDArGa9TqK2mU7cLCbwbes/mc3w2zfaC0tzvL+E7C2cc/H94NtA== X-Received: by 2002:ad4:4e73:: with SMTP id ec19mr11540504qvb.78.1580499906740; Fri, 31 Jan 2020 11:45:06 -0800 (PST) Received: from mail-qt1-f171.google.com (mail-qt1-f171.google.com. [209.85.160.171]) by smtp.gmail.com with ESMTPSA id y26sm5190275qtv.28.2020.01.31.11.45.06 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 31 Jan 2020 11:45:06 -0800 (PST) Received: by mail-qt1-f171.google.com with SMTP id w8so6313113qts.11; Fri, 31 Jan 2020 11:45:06 -0800 (PST) X-Received: by 2002:ac8:4445:: with SMTP id m5mr12509772qtn.124.1580499906039; Fri, 31 Jan 2020 11:45:06 -0800 (PST) MIME-Version: 1.0 References: <0e2e97f2-df75-3c6f-9bdd-e8c2ab7bf79e@selasky.org> <20200130030911.GA15281@spy> In-Reply-To: <20200130030911.GA15281@spy> From: Eric Joyner Date: Fri, 31 Jan 2020 11:44:30 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Issue with epoch_drain_callbacks and unloading iavf(4) [using iflib] To: Mark Johnston Cc: Hans Petter Selasky , freebsd-net@freebsd.org X-Rspamd-Queue-Id: 488SP03Cl5z48LS X-Spamd-Bar: - Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of ricera10@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=ricera10@gmail.com X-Spamd-Result: default: False [-1.97 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; RCVD_TLS_ALL(0.00)[]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; NEURAL_HAM_LONG(-1.00)[-0.999,0]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-1.00)[-0.997,0]; TO_DN_SOME(0.00)[]; URI_COUNT_ODD(1.00)[5]; RCVD_COUNT_THREE(0.00)[4]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[52.219.85.209.list.dnswl.org : 127.0.5.0]; IP_SCORE(-0.97)[ipnet: 209.85.128.0/17(-3.05), asn: 15169(-1.77), country: US(-0.05)]; FORGED_SENDER(0.30)[erj@freebsd.org,ricera10@gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[52.219.85.209.rep.mailspike.net : 127.0.0.17]; MIME_TRACE(0.00)[0:+,1:+,2:~]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; FROM_NEQ_ENVFROM(0.00)[erj@freebsd.org,ricera10@gmail.com] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.29 X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 31 Jan 2020 19:45:10 -0000 On Wed, Jan 29, 2020 at 7:09 PM Mark Johnston wrote: > On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote: > > On 2020-01-29 22:44, Eric Joyner wrote: > > > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky > wrote: > > > > > > > On 2020-01-29 22:30, Eric Joyner wrote: > > > > > Hi freebsd-net, > > > > > > > > > > We've encountered an issue with unloading the iavf(4) driver on > FreeBSD > > > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we send > heavy > > > > > traffic to iavf1 and try to kldunload the driver, the kldunload > process > > > > > hangs on iavf0 until iavf1 stops receiving traffic. > > > > > > > > > > After some debugging, it looks like epoch_drain_callbacks() [via > > > > > if_detach_internal()] tries to switch CPUs to run on one that > iavf1 is > > > > > using for RX processing, but since iavf1 is busy, it can't make the > > > > switch, > > > > > so cpu_switch() just hangs and nothing happens until iavf1's RX > thread > > > > > stops being busy. > > > > > > > > > > I can work around this by inserting a kern_yield(PRI_USER) > somewhere in > > > > one > > > > > of the iavf txrx functions that iflib calls into (e.g. > > > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone > know > > > > what > > > > > to do to prevent this from happening? > > > > > > > > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher > > > > priority > > > > > than the PI_SOFT used in the group taskqueues used in iflib's RX > > > > processing? > > > > > > > > > > > > > Hi, > > > > > > > > Which scheduler is this? ULE or BSD? > > > > > > > > EPOCH(9) expects some level of round-robin scheduling on the same > > > > priority level. Setting a higher priority on EPOCH(9) might cause > epoch > > > > to start spinning w/o letting the lower priority thread which holds > the > > > > EPOCH() section to finish. > > > > > > > > --HPS > > > > > > > > > > > Hi Hans, > > > > > > kern.sched.name gives me "ULE" > > > > > > > Hi Eric, > > > > epoch_drain_callbacks() depends on that epoch_call_task() gets execution > > which is executed from a GTASKQUEUE at PI_SOFT. Also > epoch_drain_callbacks() > > runs at the priority of the calling thread, and if this is lower than > > PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work. > > I think we can fix this so that the epoch_drain_callbacks() caller's > priority does not matter. Eric, can you try the patch below? It is a > bit hacky, but the idea is to ensure that all pending callbacks can be > invoked, and then wait for each cb task to run though at least once. In > your case I think the callback task threads should still be able to run. > > > For a single CPU system you will be toast in this situation regardless if > > there is no free time on a CPU for EPOCH(). > > > > In general if epoch_call_task() doesn't get execution time, you will > have a > > problem. > > That's not the issue here, though there is at least one related problem: > if a thread in an epoch read section is preempted and the CPU is > consumed by a high priority receive processing thread, for example due > to a DOS, we have no mechanism to boost the priority of the preempted > thread except by calling epoch_synchronize_wait(). In other words, > nothing guarantees that deferred callbacks will eventually get executed > in such a scenario. > > diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c > index 0a477e1d6f7b..4ed7e7e11a3e 100644 > --- a/sys/kern/subr_epoch.c > +++ b/sys/kern/subr_epoch.c > @@ -74,15 +74,19 @@ typedef struct epoch_record { > volatile struct epoch_tdlist er_tdlist; > volatile uint32_t er_gen; > uint32_t er_cpuid; > + int er_drain_state; > } __aligned(EPOCH_ALIGN) *epoch_record_t; > > +#define EPOCH_DRAIN_START 2 > +#define EPOCH_DRAIN_RUNNING 1 > +#define EPOCH_DRAIN_DONE 0 > + > struct epoch { > struct ck_epoch e_epoch __aligned(EPOCH_ALIGN); > epoch_record_t e_pcpu_record; > int e_idx; > int e_flags; > struct sx e_drain_sx; > - struct mtx e_drain_mtx; > volatile int e_drain_count; > const char *e_name; > }; > @@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags) > epoch->e_idx = epoch_count; > epoch->e_name = name; > sx_init(&epoch->e_drain_sx, "epoch-drain-sx"); > - mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF); > allepochs[epoch_count++] = epoch; > return (epoch); > } > @@ -348,7 +351,6 @@ epoch_free(epoch_t epoch) > allepochs[epoch->e_idx] = NULL; > epoch_wait(global_epoch); > uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record); > - mtx_destroy(&epoch->e_drain_mtx); > sx_destroy(&epoch->e_drain_sx); > free(epoch, M_EPOCH); > } > @@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused) > epoch_t epoch; > ck_stack_t cb_stack; > int i, npending, total; > + bool draining; > + > + KASSERT(curthread->td_pinned > 0, > + ("%s: callback task thread is not pinned", __func__)); > > ck_stack_init(&cb_stack); > critical_enter(); > epoch_enter(global_epoch); > - for (total = i = 0; i < epoch_count; i++) { > + for (total = i = 0, draining = false; i < epoch_count; i++) { > if (__predict_false((epoch = allepochs[i]) == NULL)) > continue; > er = epoch_currecord(epoch); > + if (atomic_load_int(&er->er_drain_state) == > EPOCH_DRAIN_START) { > + atomic_store_int(&er->er_drain_state, > + EPOCH_DRAIN_RUNNING); > + draining = true; > + } > + > record = &er->er_record; > if ((npending = record->n_pending) == 0) > continue; > @@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused) > next = CK_STACK_NEXT(cursor); > entry->function(entry); > } > + > + if (__predict_false(draining)) { > + epoch_enter(global_epoch); > + for (i = 0; i < epoch_count; i++) { > + if (__predict_false((epoch = allepochs[i]) == > NULL)) > + continue; > + er = epoch_currecord(epoch); > + if (atomic_load_int(&er->er_drain_state) == > + EPOCH_DRAIN_RUNNING) > + atomic_store_int(&er->er_drain_state, > + EPOCH_DRAIN_DONE); > + } > + epoch_exit(global_epoch); > + } > } > > int > @@ -769,27 +795,18 @@ in_epoch(epoch_t epoch) > } > > static void > -epoch_drain_cb(struct epoch_context *ctx) > +epoch_drain_handler(struct ck_epoch *global __unused, > + ck_epoch_record_t *cr __unused, void *arg __unused) > { > - struct epoch *epoch = > - __containerof(ctx, struct epoch_record, > er_drain_ctx)->er_parent; > - > - if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) { > - mtx_lock(&epoch->e_drain_mtx); > - wakeup(epoch); > - mtx_unlock(&epoch->e_drain_mtx); > - } > + maybe_yield(); > } > > void > epoch_drain_callbacks(epoch_t epoch) > { > epoch_record_t er; > - struct thread *td; > - int was_bound; > - int old_pinned; > - int old_cpu; > - int cpu; > + int cpu, state; > + bool pending; > > WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, > "epoch_drain_callbacks() may sleep!"); > @@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch) > return; > #endif > DROP_GIANT(); > - > sx_xlock(&epoch->e_drain_sx); > - mtx_lock(&epoch->e_drain_mtx); > > - td = curthread; > - thread_lock(td); > - old_cpu = PCPU_GET(cpuid); > - old_pinned = td->td_pinned; > - was_bound = sched_is_bound(td); > - sched_unbind(td); > - td->td_pinned = 0; > + /* Make sure that all pending callbacks are available. */ > + ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler, > NULL); > > - CPU_FOREACH(cpu) > - epoch->e_drain_count++; > CPU_FOREACH(cpu) { > er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu); > - sched_bind(td, cpu); > - epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx); > + atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START); > + GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task)); > } > > - /* restore CPU binding, if any */ > - if (was_bound != 0) { > - sched_bind(td, old_cpu); > - } else { > - /* get thread back to initial CPU, if any */ > - if (old_pinned != 0) > - sched_bind(td, old_cpu); > - sched_unbind(td); > - } > - /* restore pinned after bind */ > - td->td_pinned = old_pinned; > - > - thread_unlock(td); > - > - while (epoch->e_drain_count != 0) > - msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0); > + do { > + pending = false; > + CPU_FOREACH(cpu) { > + er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu); > + state = atomic_load_int(&er->er_drain_state); > + if (state != EPOCH_DRAIN_DONE) > + pending = true; > + } > + pause("edrain", 1); > + } while (pending); > > - mtx_unlock(&epoch->e_drain_mtx); > sx_xunlock(&epoch->e_drain_sx); > - > PICKUP_GIANT(); > } > _______________________________________________ > freebsd-net@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscribe@freebsd.org" > Hi Mark, I applied your patch to a FreeBSD-current VM and reran the test I was running originally, and I still see the same behavior. Your patch doesn't appear to fix the problem. - Eric