From owner-freebsd-net@freebsd.org  Thu Jan 30 03:09:21 2020
Return-Path: <owner-freebsd-net@freebsd.org>
Delivered-To: freebsd-net@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 1A343229D9E
 for <freebsd-net@mailman.nyi.freebsd.org>;
 Thu, 30 Jan 2020 03:09:21 +0000 (UTC)
 (envelope-from markjdb@gmail.com)
Received: from mail-qt1-x843.google.com (mail-qt1-x843.google.com
 [IPv6:2607:f8b0:4864:20::843])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 server-signature RSA-PSS (4096 bits)
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 487QLR6ftmz4YvL;
 Thu, 30 Jan 2020 03:09:19 +0000 (UTC)
 (envelope-from markjdb@gmail.com)
Received: by mail-qt1-x843.google.com with SMTP id l19so1315030qtq.8;
 Wed, 29 Jan 2020 19:09:19 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=sender:date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to;
 bh=KEG8gEYTl5AW4Emi6WOyAMRuvBVOJlYDAOBiQ3fye3I=;
 b=V5MDkm1HLp81L9XHt2Lof7LqPlRV43D1vwv/Kq3Ar5phgaNAH4+1SgITT4mmI/Qq6J
 1FjY8y++qtJ79iSYWEOA/EqJt4QBT80wBw4z1V/XtAfa7+//Cc872FlygEtOOgE8dzaq
 4/8PzCHC7moSSBIPXCbRsNIkHp7FDv1bbhC11r3qX3U0MXFNI8rh/dqFsYNyniRXJA13
 eJdJHgO43tbgOf9PPQnz7APohV3h5FPjF08kxc6Ha8t7U7xRbnBB94iXgXMJl2IxJDsB
 LOTpuKLjqHaMjRaIpk/3FjK20IcJ3TLqH46xVfCraJEYJKtS+SnaX+jJnSUdtkG2WTij
 MyIA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
 :references:mime-version:content-disposition:in-reply-to;
 bh=KEG8gEYTl5AW4Emi6WOyAMRuvBVOJlYDAOBiQ3fye3I=;
 b=UuuqpXdvimq/JQ5IJ2bW58I/KLa2OFUuueBIbcXElKMwm6mCSZ3Qe/BCWYYhSrwpET
 RKpRAjo+97tAxJ57G3JLsoG+UZiDAv9LK/5jm1vXSwuSw235zutxbbzhaTjE58LsPdG5
 tzBjActvkNc+eKgTmlBZGnt16ov4UnsagZCBLLqd+YMRbDYTtcmOM8FtnLdA/GS+Sve+
 ZAYVB7xSUGnxdVM06gn0leNyRi18wA/1/Zl4YRTxOV3fgFzOAoIXJQrO/V+WotHUPWVE
 RkczRFZnp4Th/OketS4Jt47xAzc2/NSoybBj4Wf6we9f+51zPGqz/Gv/joFOHS+04dgb
 b9AA==
X-Gm-Message-State: APjAAAU5Y3oSpVzrpuvajPFj9I9sNBAWEjXUfISog3ha69bqiLrKy+sG
 o9cziUyfp4XF/SYmbM8n6rgfQ2QS
X-Google-Smtp-Source: APXvYqyOZW+jUWLmmpthgAjCycoHjMQSoLGcOBozA+UhumRNYtRJakkl71kwsH0Jj08TKCYbAX4uhA==
X-Received: by 2002:ac8:4e94:: with SMTP id 20mr2679575qtp.335.1580353758534; 
 Wed, 29 Jan 2020 19:09:18 -0800 (PST)
Received: from spy (dhcp-108-168-17-189.cable.user.start.ca. [108.168.17.189])
 by smtp.gmail.com with ESMTPSA id
 s48sm2229873qtc.96.2020.01.29.19.09.16
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 29 Jan 2020 19:09:17 -0800 (PST)
Sender: Mark Johnston <markjdb@gmail.com>
Date: Wed, 29 Jan 2020 22:09:11 -0500
From: Mark Johnston <markj@freebsd.org>
To: Hans Petter Selasky <hps@selasky.org>
Cc: Eric Joyner <erj@freebsd.org>, freebsd-net@freebsd.org
Subject: Re: Issue with epoch_drain_callbacks and unloading iavf(4) [using
 iflib]
Message-ID: <20200130030911.GA15281@spy>
References: <CAKdFRZjxp=mTkUzFU8qsacP86OQOC9vCDCQ+O2iF7svRRGDK8w@mail.gmail.com>
 <0e2e97f2-df75-3c6f-9bdd-e8c2ab7bf79e@selasky.org>
 <CAKdFRZi3UoRuz=OXnBG=NVcJe605x9OwrLmdCyD98mDeTpbf0Q@mail.gmail.com>
 <a6523ed6-9d61-d1b4-5822-5787cf5c0e43@selasky.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <a6523ed6-9d61-d1b4-5822-5787cf5c0e43@selasky.org>
X-Rspamd-Queue-Id: 487QLR6ftmz4YvL
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org;
 dkim=pass header.d=gmail.com header.s=20161025 header.b=V5MDkm1H;
 dmarc=none;
 spf=pass (mx1.freebsd.org: domain of markjdb@gmail.com designates
 2607:f8b0:4864:20::843 as permitted sender) smtp.mailfrom=markjdb@gmail.com
X-Spamd-Result: default: False [-2.03 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[];
 R_DKIM_ALLOW(-0.20)[gmail.com:s=20161025];
 NEURAL_HAM_MEDIUM(-1.00)[-0.995,0]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36];
 NEURAL_HAM_LONG(-0.99)[-0.986,0]; MIME_GOOD(-0.10)[text/plain];
 MIME_TRACE(0.00)[0:+]; DMARC_NA(0.00)[freebsd.org];
 TO_DN_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+];
 RCVD_IN_DNSWL_NONE(0.00)[3.4.8.0.0.0.0.0.0.0.0.0.0.0.0.0.0.2.0.0.4.6.8.4.0.b.8.f.7.0.6.2.list.dnswl.org
 : 127.0.5.0]; 
 IP_SCORE(-0.35)[ip: (2.13), ipnet: 2607:f8b0::/32(-2.03), asn: 15169(-1.78),
 country: US(-0.05)]; 
 FORGED_SENDER(0.30)[markj@freebsd.org,markjdb@gmail.com];
 MID_RHS_NOT_FQDN(0.50)[]; FREEMAIL_ENVFROM(0.00)[gmail.com];
 ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US];
 FROM_NEQ_ENVFROM(0.00)[markj@freebsd.org,markjdb@gmail.com];
 RCVD_TLS_ALL(0.00)[]
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net/>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-net>,
 <mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Jan 2020 03:09:21 -0000

On Thu, Jan 30, 2020 at 02:12:05AM +0100, Hans Petter Selasky wrote:
> On 2020-01-29 22:44, Eric Joyner wrote:
> > On Wed, Jan 29, 2020 at 1:41 PM Hans Petter Selasky <hps@selasky.org> wrote:
> > 
> > > On 2020-01-29 22:30, Eric Joyner wrote:
> > > > Hi freebsd-net,
> > > > 
> > > > We've encountered an issue with unloading the iavf(4) driver on FreeBSD
> > > > 12.1 (and stable). On a VM with two iavf(4) interfaces, if we send heavy
> > > > traffic to iavf1 and try to kldunload the driver, the kldunload process
> > > > hangs on iavf0 until iavf1 stops receiving traffic.
> > > > 
> > > > After some debugging, it looks like epoch_drain_callbacks() [via
> > > > if_detach_internal()] tries to switch CPUs to run on one that iavf1 is
> > > > using for RX processing, but since iavf1 is busy, it can't make the
> > > switch,
> > > > so cpu_switch() just hangs and nothing happens until iavf1's RX thread
> > > > stops being busy.
> > > > 
> > > > I can work around this by inserting a kern_yield(PRI_USER) somewhere in
> > > one
> > > > of the iavf txrx functions that iflib calls into (e.g.
> > > > iavf_isc_rxd_available), but that's not a proper fix. Does anyone know
> > > what
> > > > to do to prevent this from happening?
> > > > 
> > > > Wildly guessing, does maybe epoch_drain_callbacks() need a higher
> > > priority
> > > > than the PI_SOFT used in the group taskqueues used in iflib's RX
> > > processing?
> > > > 
> > > 
> > > Hi,
> > > 
> > > Which scheduler is this? ULE or BSD?
> > > 
> > > EPOCH(9) expects some level of round-robin scheduling on the same
> > > priority level. Setting a higher priority on EPOCH(9) might cause epoch
> > > to start spinning w/o letting the lower priority thread which holds the
> > > EPOCH() section to finish.
> > > 
> > > --HPS
> > > 
> > > 
> > Hi Hans,
> > 
> > kern.sched.name gives me "ULE"
> > 
> 
> Hi Eric,
> 
> epoch_drain_callbacks() depends on that epoch_call_task() gets execution
> which is executed from a GTASKQUEUE at PI_SOFT. Also epoch_drain_callbacks()
> runs at the priority of the calling thread, and if this is lower than
> PI_SOFT, and a gtaskqueue is spinning heavily, then that won't work.

I think we can fix this so that the epoch_drain_callbacks() caller's
priority does not matter.  Eric, can you try the patch below?  It is a
bit hacky, but the idea is to ensure that all pending callbacks can be
invoked, and then wait for each cb task to run though at least once.  In
your case I think the callback task threads should still be able to run.

> For a single CPU system you will be toast in this situation regardless if
> there is no free time on a CPU for EPOCH().
>
> In general if epoch_call_task() doesn't get execution time, you will have a
> problem.

That's not the issue here, though there is at least one related problem:
if a thread in an epoch read section is preempted and the CPU is
consumed by a high priority receive processing thread, for example due
to a DOS, we have no mechanism to boost the priority of the preempted
thread except by calling epoch_synchronize_wait().  In other words,
nothing guarantees that deferred callbacks will eventually get executed
in such a scenario.

diff --git a/sys/kern/subr_epoch.c b/sys/kern/subr_epoch.c
index 0a477e1d6f7b..4ed7e7e11a3e 100644
--- a/sys/kern/subr_epoch.c
+++ b/sys/kern/subr_epoch.c
@@ -74,15 +74,19 @@ typedef struct epoch_record {
 	volatile struct epoch_tdlist er_tdlist;
 	volatile uint32_t er_gen;
 	uint32_t er_cpuid;
+	int er_drain_state;
 } __aligned(EPOCH_ALIGN)     *epoch_record_t;
 
+#define	EPOCH_DRAIN_START	2
+#define	EPOCH_DRAIN_RUNNING	1
+#define	EPOCH_DRAIN_DONE	0
+
 struct epoch {
 	struct ck_epoch e_epoch __aligned(EPOCH_ALIGN);
 	epoch_record_t e_pcpu_record;
 	int	e_idx;
 	int	e_flags;
 	struct sx e_drain_sx;
-	struct mtx e_drain_mtx;
 	volatile int e_drain_count;
 	const char *e_name;
 };
@@ -335,7 +339,6 @@ epoch_alloc(const char *name, int flags)
 	epoch->e_idx = epoch_count;
 	epoch->e_name = name;
 	sx_init(&epoch->e_drain_sx, "epoch-drain-sx");
-	mtx_init(&epoch->e_drain_mtx, "epoch-drain-mtx", NULL, MTX_DEF);
 	allepochs[epoch_count++] = epoch;
 	return (epoch);
 }
@@ -348,7 +351,6 @@ epoch_free(epoch_t epoch)
 	allepochs[epoch->e_idx] = NULL;
 	epoch_wait(global_epoch);
 	uma_zfree_pcpu(pcpu_zone_record, epoch->e_pcpu_record);
-	mtx_destroy(&epoch->e_drain_mtx);
 	sx_destroy(&epoch->e_drain_sx);
 	free(epoch, M_EPOCH);
 }
@@ -699,14 +701,24 @@ epoch_call_task(void *arg __unused)
 	epoch_t epoch;
 	ck_stack_t cb_stack;
 	int i, npending, total;
+	bool draining;
+
+	KASSERT(curthread->td_pinned > 0,
+	    ("%s: callback task thread is not pinned", __func__));
 
 	ck_stack_init(&cb_stack);
 	critical_enter();
 	epoch_enter(global_epoch);
-	for (total = i = 0; i < epoch_count; i++) {
+	for (total = i = 0, draining = false; i < epoch_count; i++) {
 		if (__predict_false((epoch = allepochs[i]) == NULL))
 			continue;
 		er = epoch_currecord(epoch);
+		if (atomic_load_int(&er->er_drain_state) == EPOCH_DRAIN_START) {
+			atomic_store_int(&er->er_drain_state,
+			    EPOCH_DRAIN_RUNNING);
+			draining = true;
+		}
+
 		record = &er->er_record;
 		if ((npending = record->n_pending) == 0)
 			continue;
@@ -728,6 +740,20 @@ epoch_call_task(void *arg __unused)
 		next = CK_STACK_NEXT(cursor);
 		entry->function(entry);
 	}
+
+	if (__predict_false(draining)) {
+		epoch_enter(global_epoch);
+		for (i = 0; i < epoch_count; i++) {
+			if (__predict_false((epoch = allepochs[i]) == NULL))
+				continue;
+			er = epoch_currecord(epoch);
+			if (atomic_load_int(&er->er_drain_state) ==
+			    EPOCH_DRAIN_RUNNING)
+				atomic_store_int(&er->er_drain_state,
+				    EPOCH_DRAIN_DONE);
+		}
+		epoch_exit(global_epoch);
+	}
 }
 
 int
@@ -769,27 +795,18 @@ in_epoch(epoch_t epoch)
 }
 
 static void
-epoch_drain_cb(struct epoch_context *ctx)
+epoch_drain_handler(struct ck_epoch *global __unused,
+    ck_epoch_record_t *cr __unused, void *arg __unused)
 {
-	struct epoch *epoch =
-	    __containerof(ctx, struct epoch_record, er_drain_ctx)->er_parent;
-
-	if (atomic_fetchadd_int(&epoch->e_drain_count, -1) == 1) {
-		mtx_lock(&epoch->e_drain_mtx);
-		wakeup(epoch);
-		mtx_unlock(&epoch->e_drain_mtx);
-	}
+	maybe_yield();
 }
 
 void
 epoch_drain_callbacks(epoch_t epoch)
 {
 	epoch_record_t er;
-	struct thread *td;
-	int was_bound;
-	int old_pinned;
-	int old_cpu;
-	int cpu;
+	int cpu, state;
+	bool pending;
 
 	WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL,
 	    "epoch_drain_callbacks() may sleep!");
@@ -802,45 +819,28 @@ epoch_drain_callbacks(epoch_t epoch)
 		return;
 #endif
 	DROP_GIANT();
-
 	sx_xlock(&epoch->e_drain_sx);
-	mtx_lock(&epoch->e_drain_mtx);
 
-	td = curthread;
-	thread_lock(td);
-	old_cpu = PCPU_GET(cpuid);
-	old_pinned = td->td_pinned;
-	was_bound = sched_is_bound(td);
-	sched_unbind(td);
-	td->td_pinned = 0;
+	/* Make sure that all pending callbacks are available. */
+	ck_epoch_synchronize_wait(&epoch->e_epoch, epoch_drain_handler, NULL);
 
-	CPU_FOREACH(cpu)
-		epoch->e_drain_count++;
 	CPU_FOREACH(cpu) {
 		er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
-		sched_bind(td, cpu);
-		epoch_call(epoch, &epoch_drain_cb, &er->er_drain_ctx);
+		atomic_store_int(&er->er_drain_state, EPOCH_DRAIN_START);
+		GROUPTASK_ENQUEUE(DPCPU_ID_PTR(cpu, epoch_cb_task));
 	}
 
-	/* restore CPU binding, if any */
-	if (was_bound != 0) {
-		sched_bind(td, old_cpu);
-	} else {
-		/* get thread back to initial CPU, if any */
-		if (old_pinned != 0)
-			sched_bind(td, old_cpu);
-		sched_unbind(td);
-	}
-	/* restore pinned after bind */
-	td->td_pinned = old_pinned;
-
-	thread_unlock(td);
-
-	while (epoch->e_drain_count != 0)
-		msleep(epoch, &epoch->e_drain_mtx, PZERO, "EDRAIN", 0);
+	do {
+		pending = false;
+		CPU_FOREACH(cpu) {
+			er = zpcpu_get_cpu(epoch->e_pcpu_record, cpu);
+			state = atomic_load_int(&er->er_drain_state);
+			if (state != EPOCH_DRAIN_DONE)
+				pending = true;
+		}
+		pause("edrain", 1);
+	} while (pending);
 
-	mtx_unlock(&epoch->e_drain_mtx);
 	sx_xunlock(&epoch->e_drain_sx);
-
 	PICKUP_GIANT();
 }