From owner-svn-src-projects@FreeBSD.ORG  Tue May 19 14:10:14 2009
Return-Path: <owner-svn-src-projects@FreeBSD.ORG>
Delivered-To: svn-src-projects@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id F18CE106564A;
	Tue, 19 May 2009 14:10:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c])
	by mx1.freebsd.org (Postfix) with ESMTP id DF23D8FC13;
	Tue, 19 May 2009 14:10:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from svn.freebsd.org (localhost [127.0.0.1])
	by svn.freebsd.org (8.14.3/8.14.3) with ESMTP id n4JEAE1M099852;
	Tue, 19 May 2009 14:10:14 GMT (envelope-from rwatson@svn.freebsd.org)
Received: (from rwatson@localhost)
	by svn.freebsd.org (8.14.3/8.14.3/Submit) id n4JEAEUo099851;
	Tue, 19 May 2009 14:10:14 GMT (envelope-from rwatson@svn.freebsd.org)
Message-Id: <200905191410.n4JEAEUo099851@svn.freebsd.org>
From: Robert Watson <rwatson@FreeBSD.org>
Date: Tue, 19 May 2009 14:10:14 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-projects@freebsd.org
X-SVN-Group: projects
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: 
Subject: svn commit: r192380 - projects/pnet/sys/net
X-BeenThere: svn-src-projects@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the src &quot; projects&quot;
	tree" <svn-src-projects.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-projects>, 
	<mailto:svn-src-projects-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-projects>
List-Post: <mailto:svn-src-projects@freebsd.org>
List-Help: <mailto:svn-src-projects-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-projects>, 
	<mailto:svn-src-projects-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 19 May 2009 14:10:15 -0000

Author: rwatson
Date: Tue May 19 14:10:14 2009
New Revision: 192380
URL: http://svn.freebsd.org/changeset/base/192380

Log:
  Read-locking of netisr2 during dispatch is no longer optional, so update
  comment.  Add an XXXRW because assertions aren't implemented for rmlocks.
  
  Break out net.isr2.direct into two sysctls: net.isr2.direct_force to
  force direct dispatch to always be used even if work should be processed
  by a workstream, and net.isr2.direct_enable, which allows direct dispatch
  to be used.  Document the three modes of operation we support using these
  flags.
  
  Add a new direct dispatch mode, "hybrid", in which work destined for the
  current CPU may directly dispatch, but work destined for other CPUs will
  be queued.  Hybrid direct dispatch is only permitted if there is no
  pending work, the netisr isn't already running, and no other thread is
  directly dispatching the workstream, as these could lead to misordering.
  Add a new hybrid direct dispatch counter.
  
  Fix a bug in the tunable portion of the net.isr2.maxqlimit sysctl and
  document it.  Make it a writable sysctl.  Don't test queue limits during
  registration with an assert -- instead, cap and printf.
  
  Update stats printout in DDB.
  
  Default settings remain the same as in the current netisr code: force
  direct dispatch, create a single worker thread, and don't bind it to a
  specific CPU.

Modified:
  projects/pnet/sys/net/netisr2.c

Modified: projects/pnet/sys/net/netisr2.c
==============================================================================
--- projects/pnet/sys/net/netisr2.c	Tue May 19 14:08:21 2009	(r192379)
+++ projects/pnet/sys/net/netisr2.c	Tue May 19 14:10:14 2009	(r192380)
@@ -84,15 +84,13 @@ __FBSDID("$FreeBSD$");
  * acquire a read lock while modifying the set of registered protocols to
  * prevent partially registered or unregistered protocols from being run.
  *
- * We make per-packet use optional so that we can measure the performance
- * impact of providing consistency against run-time registration and
- * deregristration, which is a very uncommon event.
- *
  * The following data structures and fields are protected by this lock:
  *
  * - The np array, including all fields of struct netisr_proto.
  * - The nws array, including all fields of struct netisr_worker.
  * - The nws_array array.
+ *
+ * XXXRW: rmlocks don't support assertions.
  */
 static struct rmlock	netisr_rmlock;
 #define	NETISR_LOCK_INIT()	rm_init(&netisr_rmlock, "netisr", 0)
@@ -108,9 +106,31 @@ static struct rmlock	netisr_rmlock;
 
 SYSCTL_NODE(_net, OID_AUTO, isr2, CTLFLAG_RW, 0, "netisr2");
 
-static int	netisr_direct = 1;	/* Enable direct dispatch. */
-SYSCTL_INT(_net_isr2, OID_AUTO, direct, CTLFLAG_RW,
-    &netisr_direct, 0, "Direct dispatch");
+/*-
+ * Three direct dispatch policies are supported:
+ *
+ * - Always defer: all work is scheduled for a netisr, regardless of context.
+ *
+ * - Always direct: if the executing context allows direct dispatch, always
+ *   direct dispatch.
+ *
+ * - Hybrid: if the executing context allows direct dispatch, and we're
+ *   running on the CPU the work would be done on, then direct dispatch if it
+ *   wouldn't violate ordering constraints on the workstream.
+ *
+ * These policies are captured using two sysctls -- direct_enable allows
+ * direct dispatch, and direct_force forces direct dispatch if enabled.
+ * Notice that changing the global policy could lead to short periods of
+ * disordered processing, but this is considered acceptable as compared to
+ * the complexity of enforcing ordering during policy changes.
+ */
+static int	netisr_direct_force = 1;	/* Always direct dispatch. */
+SYSCTL_INT(_net_isr2, OID_AUTO, direct_force, CTLFLAG_RW,
+    &netisr_direct_force, 0, "Force direct dispatch");
+
+static int	netisr_direct_enable = 1;	/* Enable direct dispatch. */
+SYSCTL_INT(_net_isr2, OID_AUTO, direct_enable, CTLFLAG_RW,
+    &netisr_direct_enable, 0, "Enable direct dispatch");
 
 /*
  * Allow the administrator to limit the number of threads (CPUs) to use for
@@ -130,10 +150,14 @@ TUNABLE_INT("net.isr2.bindthreads", &net
 SYSCTL_INT(_net_isr2, OID_AUTO, bindthreads, CTLFLAG_RD,
     &netisr_bindthreads, 0, "Bind netisr2 threads to CPUs.");
 
-#define	NETISR_MAXQLIMIT	10240
-static int	netisr_maxqlimit = NETISR_MAXQLIMIT;
-TUNABLE_INT("net.isr2.bindthreads", &netisr_bindthreads);
-SYSCTL_INT(_net_isr2, OID_AUTO, maxqlimit, CTLFLAG_RD,
+/*
+ * Limit per-workstream queues to at most net.isr2.maxqlimit, both for
+ * initial configuration and later modification using netisr2_setqlimit().
+ */
+#define	NETISR_DEFAULT_MAXQLIMIT	10240
+static int	netisr_maxqlimit = NETISR_DEFAULT_MAXQLIMIT;
+TUNABLE_INT("net.isr2.maxqlimit", &netisr_maxqlimit);
+SYSCTL_INT(_net_isr2, OID_AUTO, maxqlimit, CTLFLAG_RW,
     &netisr_maxqlimit, 0,
     "Maximum netisr2 per-protocol, per-CPU queue depth.");
 
@@ -181,9 +205,10 @@ struct netisr_work {
 	 * Statistics -- written unlocked, but mostly from curcpu.
 	 */
 	u_int64_t	 nw_dispatched; /* Number of direct dispatches. */
-	u_int64_t	 nw_qdrops;	/* Number of drops. */
-	u_int64_t	 nw_queued;	/* Number of enqueues. */
-	u_int64_t	 nw_handled;	/* Number passed into handler. */
+	u_int64_t	 nw_hybrid_dispatched; /* "" hybrid dispatches. */
+	u_int64_t	 nw_qdrops;	/* "" drops. */
+	u_int64_t	 nw_queued;	/* "" enqueues. */
+	u_int64_t	 nw_handled;	/* "" handled in worker. */
 };
 
 /*
@@ -232,7 +257,8 @@ static u_int				 nws_count;
  * Per-workstream flags.
  */
 #define	NWS_RUNNING	0x00000001	/* Currently running in a thread. */
-#define	NWS_SIGNALED	0x00000002	/* Signal issued. */
+#define	NWS_DISPATCHING	0x00000002	/* Currently being direct-dispatched. */
+#define	NWS_SCHEDULED	0x00000004	/* Signal issued. */
 
 /*
  * Synchronization for each workstream: a mutex protects all mutable fields
@@ -328,9 +354,6 @@ netisr2_register(const struct netisr_han
 	    "%s", name));
 	KASSERT(nhp->nh_qlimit != 0,
 	    ("netisr2_register: nh_qlimit 0 for %s", name));
-	KASSERT(nhp->nh_qlimit <= netisr_maxqlimit,
-	    ("netisr2_register: nh_qlimit (%u) above max %u for %s",
-	    nhp->nh_qlimit, netisr_maxqlimit, name));
 
 	/*
 	 * Initialize global and per-workstream protocol state.
@@ -339,7 +362,13 @@ netisr2_register(const struct netisr_han
 	np[proto].np_handler = nhp->nh_handler;
 	np[proto].np_m2flow = nhp->nh_m2flow;
 	np[proto].np_m2cpu = nhp->nh_m2cpu;
-	np[proto].np_qlimit = nhp->nh_qlimit;
+	if (nhp->nh_qlimit > netisr_maxqlimit) {
+		printf("netisr2_register: %s requested queue limit %u "
+		    "capped to net.isr2.maxqlimit %u\n", name,
+		    nhp->nh_qlimit, netisr_maxqlimit);
+		np[proto].np_qlimit = netisr_maxqlimit;
+	} else
+		np[proto].np_qlimit = nhp->nh_qlimit;
 	np[proto].np_policy = nhp->nh_policy;
 	for (i = 0; i < MAXCPU; i++) {
 		npwp = &nws[i].nws_work[proto];
@@ -670,6 +699,8 @@ netisr2_process_workstream(struct netisr
 
 	KASSERT(nwsp->nws_flags & NWS_RUNNING,
 	    ("netisr2_process_workstream: not running"));
+	KASSERT(!(nwsp->nws_flags & NWS_DISPATCHING),
+	    ("netisr2_process_workstream: dispatching"));
 	if (proto == NETISR_ALLPROT) {
 		for (i = 0; i < NETISR_MAXPROT; i++)
 			netisr2_process_workstream_proto(nwsp, i);
@@ -679,7 +710,9 @@ netisr2_process_workstream(struct netisr
 
 /*
  * SWI handler for netisr2 -- processes prackets in a set of workstreams that
- * it owns.
+ * it owns, woken up by calls to NWS_SIGNAL().  If this workstream is already
+ * being direct dispatched, go back to sleep and wait for the dispatching
+ * thread to wake us up again.
  */
 static void
 swi_net(void *arg)
@@ -691,28 +724,27 @@ swi_net(void *arg)
 
 	NETISR_RLOCK(&tracker);
 	NWS_LOCK(nwsp);
+	KASSERT(!(nwsp->nws_flags & NWS_RUNNING), ("swi_net: running"));
+	if (nwsp->nws_flags & NWS_DISPATCHING)
+		goto out;
 	nwsp->nws_flags |= NWS_RUNNING;
+	nwsp->nws_flags &= ~NWS_SCHEDULED;
 	while (nwsp->nws_pendingwork != 0)
 		netisr2_process_workstream(nwsp, NETISR_ALLPROT);
-	nwsp->nws_flags &= ~(NWS_SIGNALED | NWS_RUNNING);
+	nwsp->nws_flags &= ~NWS_RUNNING;
+out:
 	NWS_UNLOCK(nwsp);
 	NETISR_RUNLOCK(&tracker);
 }
 
 static int
-netisr2_queue_internal(u_int proto, struct mbuf *m, u_int cpuid)
+netisr2_queue_workstream(struct netisr_workstream *nwsp,
+    struct netisr_work *npwp, struct mbuf *m, int *dosignalp)
 {
-	struct netisr_workstream *nwsp;
-	struct netisr_work *npwp;
-	int dosignal, error;
 
-	NETISR_LOCK_ASSERT();
+	NWS_LOCK_ASSERT(nwsp);
 
-	dosignal = 0;
-	error = 0;
-	nwsp = &nws[cpuid];
-	npwp = &nwsp->nws_work[proto];
-	NWS_LOCK(nwsp);
+	*dosignalp = 0;
 	if (npwp->nw_len < npwp->nw_qlimit) {
 		m->m_nextpkt = NULL;
 		if (npwp->nw_head == NULL) {
@@ -726,20 +758,36 @@ netisr2_queue_internal(u_int proto, stru
 		if (npwp->nw_len > npwp->nw_watermark)
 			npwp->nw_watermark = npwp->nw_len;
 		nwsp->nws_pendingwork++;
-		if (!(nwsp->nws_flags & NWS_SIGNALED)) {
-			nwsp->nws_flags |= NWS_SIGNALED;
-			dosignal = 1;	/* Defer until unlocked. */
+		if (!(nwsp->nws_flags & (NWS_SCHEDULED | NWS_RUNNING))) {
+			nwsp->nws_flags |= NWS_SCHEDULED;
+			*dosignalp = 1;	/* Defer until unlocked. */
 		}
-		error = 0;
-	} else
-		error = ENOBUFS;
+		npwp->nw_queued++;
+		return (0);
+	} else {
+		npwp->nw_qdrops++;
+		return (ENOBUFS);
+	}
+}
+
+static int
+netisr2_queue_internal(u_int proto, struct mbuf *m, u_int cpuid)
+{
+	struct netisr_workstream *nwsp;
+	struct netisr_work *npwp;
+	int dosignal, error;
+
+	NETISR_LOCK_ASSERT();
+
+	dosignal = 0;
+	error = 0;
+	nwsp = &nws[cpuid];
+	npwp = &nwsp->nws_work[proto];
+	NWS_LOCK(nwsp);
+	error = netisr2_queue_workstream(nwsp, npwp, m, &dosignal);
 	NWS_UNLOCK(nwsp);
 	if (dosignal)
 		NWS_SIGNAL(nwsp);
-	if (error)
-		npwp->nw_qdrops++;
-	else
-		npwp->nw_queued++;
 	return (error);
 }
 
@@ -781,14 +829,23 @@ netisr_queue(int proto, struct mbuf *m)
 	return (netisr2_queue(proto, m));
 }
 
+/*
+ * Dispatch a packet for netisr2 processing, direct dispatch permitted by
+ * calling context.
+ */
 int
 netisr2_dispatch_src(u_int proto, uintptr_t source, struct mbuf *m)
 {
 	struct rm_priotracker tracker;
 	struct netisr_workstream *nwsp;
 	struct netisr_work *npwp;
+	int dosignal, error;
+	u_int cpuid;
 
-	if (!netisr_direct)
+	/*
+	 * If direct dispatch is entirely disabled, fall back on queueing.
+	 */
+	if (!netisr_direct_enable)
 		return (netisr2_queue_src(proto, source, m));
 
 	KASSERT(proto < NETISR_MAXPROT,
@@ -798,15 +855,92 @@ netisr2_dispatch_src(u_int proto, uintpt
 	    ("netisr2_dispatch_src: invalid proto %u", proto));
 
 	/*
-	 * Borrow current CPU's stats, even if there's no worker.
+	 * If direct dispatch is forced, then unconditionally dispatch
+	 * without a formal CPU selection.  Borrow the current CPU's stats,
+	 * even if there's no worker on it.
+	 */
+	if (netisr_direct_force) {
+		nwsp = &nws[curcpu];
+		npwp = &nwsp->nws_work[proto];
+		npwp->nw_dispatched++;
+		npwp->nw_handled++;
+		np[proto].np_handler(m);
+		NETISR_RUNLOCK(&tracker);
+		return (0);
+	}
+
+	/*
+	 * Otherwise, we execute in a hybrid mode where we will try to direct
+	 * dispatch if we're on the right CPU and the netisr worker isn't
+	 * already running.
 	 */
-	nwsp = &nws[curcpu];
+	m = netisr2_selectcpu(&np[proto], source, m, &cpuid);
+	if (m == NULL) {
+		NETISR_RUNLOCK(&tracker);
+		return (ENOBUFS);
+	}
+	sched_pin();
+	if (cpuid != curcpu)
+		goto queue_fallback;
+	nwsp = &nws[cpuid];
 	npwp = &nwsp->nws_work[proto];
-	npwp->nw_dispatched++;
-	npwp->nw_handled++;
+
+	/*-
+	 * We are willing to direct dispatch only if three conditions hold:
+	 *
+	 * (1) The netisr worker isn't already running,
+	 * (2) Another thread isn't already directly dispatching, and
+	 * (3) The netisr hasn't already been woken up.
+	 */
+	NWS_LOCK(nwsp);
+	if (nwsp->nws_flags & (NWS_RUNNING | NWS_DISPATCHING | NWS_SCHEDULED)) {
+		error = netisr2_queue_workstream(nwsp, npwp, m, &dosignal);
+		NWS_UNLOCK(nws);
+		if (dosignal)
+			NWS_SIGNAL(nwsp);
+		sched_unpin();
+		NETISR_RUNLOCK(&tracker);
+		return (error);
+	}
+
+	/*
+	 * The current thread is now effectively the netisr worker, so set
+	 * the dispatching flag to prevent concurrent processing of the
+	 * stream from another thread (even the netisr worker), which could
+	 * otherwise lead to effective misordering of the stream.
+	 */
+	nwsp->nws_flags |= NWS_DISPATCHING;
+	NWS_UNLOCK(nwsp);
 	np[proto].np_handler(m);
+	NWS_LOCK(nwsp);
+	nwsp->nws_flags &= ~NWS_DISPATCHING;
+	npwp->nw_handled++;
+	npwp->nw_hybrid_dispatched++;
+
+	/*
+	 * If other work was enqueued by another thread while we were direct
+	 * dispatching, we need to signal the netisr worker to do that work.
+	 * In the future, we might want to do some of that work in the
+	 * current thread, rather than trigger further context switches.  If
+	 * so, we'll want to establish a reasonable bound on the work done in
+	 * the "borrowed" context.
+	 */
+	if (nwsp->nws_pendingwork != 0) {
+		nwsp->nws_flags |= NWS_SCHEDULED;
+		dosignal = 1;
+	} else
+		dosignal = 0;
+	NWS_UNLOCK(nwsp);
+	if (dosignal)
+		NWS_SIGNAL(nwsp);
 	NETISR_RUNLOCK(&tracker);
 	return (0);
+
+queue_fallback:
+	error = netisr2_queue_internal(proto, m, cpuid);
+	sched_unpin();
+	NETISR_RUNLOCK(&tracker);
+	return (error);
 }
 
 int
@@ -907,8 +1041,9 @@ DB_SHOW_COMMAND(netisr2, db_show_netisr2
 	struct netisr_work *nwp;
 	int cpu, first, proto;
 
-	db_printf("%6s %6s %6s %6s %6s %6s %8s %8s %8s %8s\n", "CPU", "Pend",
-	    "Proto", "Len", "WMark", "Max", "Disp", "Drop", "Queue", "Handle");
+	db_printf("%3s %5s %6s %5s %5s %5s %8s %8s %8s %8s %8s\n", "CPU",
+	    "Pend", "Proto", "Len", "WMark", "Max", "Disp", "HDisp", "Drop",
+	    "Queue", "Handle");
 	for (cpu = 0; cpu < MAXCPU; cpu++) {
 		nwsp = &nws[cpu];
 		if (nwsp->nws_intr_event == NULL)
@@ -919,16 +1054,16 @@ DB_SHOW_COMMAND(netisr2, db_show_netisr2
 				continue;
 			nwp = &nwsp->nws_work[proto];
 			if (first) {
-				db_printf("%6d %6d ", cpu,
+				db_printf("%3d %5d ", cpu,
 				    nwsp->nws_pendingwork);
 				first = 0;
 			} else
-				db_printf("%6s %6s ", "", "");
-			db_printf("%6s %6d %6d %6d %8ju %8ju %8ju %8ju\n",
+				db_printf("%3s %5s ", "", "");
+			db_printf("%6s %5d %5d %5d %8ju %8ju %8ju %8ju %8ju\n",
 			    np[proto].np_name, nwp->nw_len,
 			    nwp->nw_watermark, nwp->nw_qlimit,
-			    nwp->nw_dispatched, nwp->nw_qdrops,
-			    nwp->nw_queued, nwp->nw_handled);
+			    nwp->nw_dispatched, nwp->nw_hybrid_dispatched,
+			    nwp->nw_qdrops, nwp->nw_queued, nwp->nw_handled);
 		}
 	}
 }