From owner-svn-src-user@FreeBSD.ORG  Sun May 15 10:01:38 2011
Return-Path: <owner-svn-src-user@FreeBSD.ORG>
Delivered-To: svn-src-user@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id DCDCD106564A;
	Sun, 15 May 2011 10:01:38 +0000 (UTC) (envelope-from avg@FreeBSD.org)
Received: from svn.freebsd.org (svn.freebsd.org [IPv6:2001:4f8:fff6::2c])
	by mx1.freebsd.org (Postfix) with ESMTP id C0F458FC12;
	Sun, 15 May 2011 10:01:38 +0000 (UTC)
Received: from svn.freebsd.org (localhost [127.0.0.1])
	by svn.freebsd.org (8.14.4/8.14.4) with ESMTP id p4FA1cYZ032379;
	Sun, 15 May 2011 10:01:38 GMT (envelope-from avg@svn.freebsd.org)
Received: (from avg@localhost)
	by svn.freebsd.org (8.14.4/8.14.4/Submit) id p4FA1cas032377;
	Sun, 15 May 2011 10:01:38 GMT (envelope-from avg@svn.freebsd.org)
Message-Id: <201105151001.p4FA1cas032377@svn.freebsd.org>
From: Andriy Gapon <avg@FreeBSD.org>
Date: Sun, 15 May 2011 10:01:38 +0000 (UTC)
To: src-committers@freebsd.org, svn-src-user@freebsd.org
X-SVN-Group: user
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Cc: 
Subject: svn commit: r221951 - user/avg/xcpu/sys/kern
X-BeenThere: svn-src-user@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the experimental &quot; user&quot;
	src tree" <svn-src-user.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-user>,
	<mailto:svn-src-user-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-user>
List-Post: <mailto:svn-src-user@freebsd.org>
List-Help: <mailto:svn-src-user-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-user>,
	<mailto:svn-src-user-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 15 May 2011 10:01:38 -0000

Author: avg
Date: Sun May 15 10:01:38 2011
New Revision: 221951
URL: http://svn.freebsd.org/changeset/base/221951

Log:
  smp_rendezvous: correctly support concurrent complex rendezvous
  
  This is a big change.
  A problem with the previous approach can be demonstrated by this example:
  - processors Ps1 and Ps2 issue complex rendezvous requests R1 and R2
    to processors Pt1 and Pt2
  - because of request timings Pt1 gets R1 before R2 while Pt2 gets R2 before R1
  - Pt1 executes setup action of R1 and waits for Pt2 to do the same
  - Pt2 executes setup action of R2 and waits for Pt1 to do the same
  - deadlock
  
  New approach is that we should check for incoming events while waiting
  in between of actions of a complex rendezvous event.  In the example above
  this means that Pt1 after execution of action of R1 would then notice
  R2 and execute its setup action.  Thus we allow actions of different
  requests to be interleaved.  The only important limitation is that target
  CPUs should not leave rendezvous handling context until all of rendezvous'
  actions are executed.
  To implement this approach all rendezvous sequence points are coordinated
  via a rendezvous' master CPU.  Slave CPUs just execute one action and go
  back to main event processing loop.  The master CPU waits for all targeted
  slave CPUs to executed the current action (and while waiting it also checks
  for incoming events).  When all the slave CPUs finish executing current
  action, then the master CPU signals them to execute the next action.
  
  Some implementation details:
  - master CPU increases work count for slave CPUs by total count of
    non-no-barrier actions in a rendezvous, so that the slave CPUs
    do not leave rendezvous handling context until all actions are done
  - master CPU processes incoming request in between posting its
    actions
  - master CPU bumps its work count before posting action, so that
    other CPUs know that the master CPU can process incoming events
    without being IPI-ed
  - master CPU processes all incoming events before leaving smp_rendezvous
    so that it fully completes incoming complex rendezvous without
    leaving the context

Modified:
  user/avg/xcpu/sys/kern/subr_smp.c

Modified: user/avg/xcpu/sys/kern/subr_smp.c
==============================================================================
--- user/avg/xcpu/sys/kern/subr_smp.c	Sun May 15 09:59:29 2011	(r221950)
+++ user/avg/xcpu/sys/kern/subr_smp.c	Sun May 15 10:01:38 2011	(r221951)
@@ -114,15 +114,13 @@ SYSCTL_INT(_kern_smp, OID_AUTO, forward_
 
 /* Variables needed for SMP rendezvous. */
 struct smp_rendezvous_data {
-	void (*smp_rv_setup_func)(void *arg);
 	void (*smp_rv_action_func)(void *arg);
-	void (*smp_rv_teardown_func)(void *arg);
 	void *smp_rv_func_arg;
-	volatile int smp_rv_waiters[3];
+	int smp_rv_waiters;
 	int smp_rv_ncpus;
 };
 
-static DPCPU_DEFINE(struct smp_rendezvous_data, smp_rv_data);
+volatile static DPCPU_DEFINE(struct smp_rendezvous_data, smp_rv_data);
 static volatile DPCPU_DEFINE(cpumask_t, smp_rv_senders);
 static volatile DPCPU_DEFINE(cpumask_t, smp_rv_count);
 
@@ -412,100 +410,160 @@ static void
 smp_rendezvous_action_body(int cpu)
 {
 	volatile struct smp_rendezvous_data *rv;
-	void *local_func_arg;
-	void (*local_setup_func)(void*);
-	void (*local_action_func)(void*);
-	void (*local_teardown_func)(void*);
-	int ncpus;
+	void *func_arg;
+	void (*action_func)(void*);
 
 	rv = DPCPU_ID_PTR(cpu, smp_rv_data);
-	local_func_arg = rv->smp_rv_func_arg;
-	local_setup_func = rv->smp_rv_setup_func;
-	local_action_func = rv->smp_rv_action_func;
-	local_teardown_func = rv->smp_rv_teardown_func;
-	ncpus = rv->smp_rv_ncpus;
+	func_arg = rv->smp_rv_func_arg;
+	action_func = rv->smp_rv_action_func;
 
-	/* setup function */
-	if (local_setup_func != smp_no_rendevous_barrier) {
-		if (local_setup_func != NULL)
-			local_setup_func(local_func_arg);
-
-		/* spin on entry rendezvous */
-		atomic_add_int(&rv->smp_rv_waiters[0], 1);
-		while (rv->smp_rv_waiters[0] < ncpus)
-			cpu_spinwait();
-	}
+	if (action_func != NULL)
+		action_func(func_arg);
 
-	/* action function */
-	if (local_action_func != NULL)
-		local_action_func(local_func_arg);
-
-	if (local_teardown_func != smp_no_rendevous_barrier) {
-		/* spin on exit rendezvous */
-		atomic_add_int(&rv->smp_rv_waiters[1], 1);
-		while (rv->smp_rv_waiters[1] < ncpus)
-			cpu_spinwait();
+	atomic_add_int(&rv->smp_rv_waiters, 1);
+}
 
-		atomic_add_int(&rv->smp_rv_waiters[2], 1);
+static int
+smp_rendezvous_action_pass(void)
+{
+	cpumask_t mask;
+	int count;
+	int cpu;
 
-		/* teardown function */
-		if (local_teardown_func != NULL)
-			local_teardown_func(local_func_arg);
-	} else
-		atomic_add_int(&rv->smp_rv_waiters[2], 1);
+	count = 0;
+	mask = DPCPU_GET(smp_rv_senders);
+	if (mask == 0)
+		return (count);
+
+	atomic_clear_acq_int(DPCPU_PTR(smp_rv_senders), mask);
+	do {
+		count++;
+		cpu = ffs(mask) - 1;
+		mask &= ~(1 << cpu);
+		smp_rendezvous_action_body(cpu);
+	} while (mask != 0);
+
+	return (count);
 }
 
 void
 smp_rendezvous_action(void)
 {
-	cpumask_t mask;
 	int pending;
 	int count;
-	int cpu;
 
 	pending = DPCPU_GET(smp_rv_count);
 	while (pending != 0) {
 		KASSERT(pending > 0, ("negative pending rendezvous count"));
-		mask = DPCPU_GET(smp_rv_senders);
-		if (mask == 0) {
+		count = smp_rendezvous_action_pass();
+		if (count == 0) {
 			cpu_spinwait();
 			continue;
 		}
-
-		atomic_clear_acq_int(DPCPU_PTR(smp_rv_senders), mask);
-		count = 0;
-		do {
-			count++;
-			cpu = ffs(mask) - 1;
-			mask &= ~(1 << cpu);
-			smp_rendezvous_action_body(cpu);
-		} while (mask != 0);
-
 		pending = atomic_fetchadd_int(DPCPU_PTR(smp_rv_count), -count);
 		pending -= count;
 	}
 }
 
 static void
-smp_rendezvous_wait(void)
+smp_rendezvous_wait(volatile struct smp_rendezvous_data *rv)
 {
-	volatile struct smp_rendezvous_data *rv;
 	int ncpus;
+	int count;
 
-	rv = DPCPU_PTR(smp_rv_data);
 	ncpus = rv->smp_rv_ncpus;
 
-	while (atomic_load_acq_int(&rv->smp_rv_waiters[2]) < ncpus) {
+	while (atomic_load_acq_int(&rv->smp_rv_waiters) < ncpus) {
 		/* check for incoming events */
 		if ((stopping_cpus & (1 << curcpu)) != 0)
 			cpustop_handler();
-		else if (DPCPU_GET(smp_rv_senders) != 0)
-			smp_rendezvous_action();
+
+		count = smp_rendezvous_action_pass();
+		if (count != 0)
+			atomic_add_int(DPCPU_PTR(smp_rv_count), -count);
 		else
 			cpu_spinwait();
 	}
 }
 
+
+static void
+smp_rendezvous_notify(int cpu, int nhold)
+{
+	int send_ipi;
+	int x;
+
+	if (cpu == curcpu)
+		return;
+
+	KASSERT((DPCPU_ID_GET(cpu, smp_rv_senders) & (1 << curcpu)) == 0,
+	    ("curcpu bit is set in target cpu's senders map"));
+
+	/*
+	 * If this is a first action of a rendezvous invocation
+	 * and we are the first to send an event, then send an ipi.
+	 */
+	send_ipi = 0;
+	if (nhold != 0) {
+		x = atomic_fetchadd_int(DPCPU_ID_PTR(cpu, smp_rv_count), nhold);
+		send_ipi = (x == 0);
+		if (!send_ipi)
+			coalesced_ipi_count++;
+	}
+
+	atomic_set_rel_int(DPCPU_ID_PTR(cpu, smp_rv_senders), 1 << curcpu);
+	if (send_ipi)
+		ipi_cpu(cpu, IPI_RENDEZVOUS);
+}
+
+static void
+smp_rendezvous_cpus_oneaction(cpumask_t map,
+	volatile struct smp_rendezvous_data *rv,
+	int nhold,
+	void (*action_func)(void *),
+	void *arg)
+{
+	cpumask_t tmp;
+	int ncpus;
+	int cpu;
+
+	/*
+	 * If nhold != 0, then this is the first call of a more complex
+	 * rendezvous invocation, so we need to setup some common data
+	 * for all calls and possibly send IPIs to target CPUs.
+	 * Otherwise, we just need to set action_func and the incoming
+	 * rendezvous bits.
+	 */
+	if (nhold != 0) {
+		ncpus = 0;
+		map &= all_cpus;
+		tmp = map;
+		while (tmp != 0) {
+			cpu = ffs(tmp) - 1;
+			tmp &= ~(1 << cpu);
+			ncpus++;
+		}
+
+		rv->smp_rv_ncpus = ncpus;
+		rv->smp_rv_func_arg = arg;
+	}
+
+	rv->smp_rv_action_func = action_func;
+	atomic_store_rel_int(&rv->smp_rv_waiters, 0);
+
+	tmp = map;
+	while (tmp != 0) {
+		cpu = ffs(tmp) - 1;
+		tmp &= ~(1 << cpu);
+
+		smp_rendezvous_notify(cpu, nhold);
+	}
+
+	/* Check if the current CPU is in the map */
+	if ((map & (1 << curcpu)) != 0)
+		smp_rendezvous_action_body(curcpu);
+}
+
 /*
  * Execute the action_func on the targeted CPUs.
  *
@@ -542,9 +600,7 @@ smp_rendezvous_cpus(cpumask_t map,
 	void *arg)
 {
 	volatile struct smp_rendezvous_data *rv;
-	cpumask_t tmp;
-	int ncpus;
-	int cpu;
+	int nhold;
 
 	if (!smp_started) {
 		if (setup_func != NULL)
@@ -556,65 +612,55 @@ smp_rendezvous_cpus(cpumask_t map,
 		return;
 	}
 
-	map &= all_cpus;
-	tmp = map;
-	ncpus = 0;
-	while (tmp != 0) {
-		cpu = ffs(tmp) - 1;
-		tmp &= ~(1 << cpu);
-		ncpus++;
-	}
+	nhold = 1;
+	if (setup_func != smp_no_rendevous_barrier)
+		nhold++;
+	if (teardown_func != smp_no_rendevous_barrier)
+		nhold++;
 
 	spinlock_enter();
+	rv = DPCPU_PTR(smp_rv_data);
+
+	/* Let other CPUs know that we are here, no need to IPI us. */
+	atomic_add_int(DPCPU_PTR(smp_rv_count), 1);
 
 	/*
-	 * First wait for an event previously posted by us to complete (if any),
-	 * this is done in case the event was asynchronous.
+	 * First wait for an event previously posted by us, if any, to complete.
 	 * In the future we could have a queue of outgoing events instead
 	 * of a single item.
 	 */
-	smp_rendezvous_wait();
-
-	/* set static function pointers */
-	rv = DPCPU_PTR(smp_rv_data);
-	rv->smp_rv_ncpus = ncpus;
-	rv->smp_rv_setup_func = setup_func;
-	rv->smp_rv_action_func = action_func;
-	rv->smp_rv_teardown_func = teardown_func;
-	rv->smp_rv_func_arg = arg;
-	rv->smp_rv_waiters[1] = 0;
-	rv->smp_rv_waiters[2] = 0;
-	atomic_store_rel_int(&rv->smp_rv_waiters[0], 0);
-
-	/* signal other CPUs, which will enter the IPI with interrupts off */
-	tmp = map;
-	while (tmp != 0) {
-		cpu = ffs(tmp) - 1;
-		tmp &= ~(1 << cpu);
+	smp_rendezvous_wait(rv);
 
-		if (cpu == curcpu)
-			continue;
+	if (setup_func != smp_no_rendevous_barrier) {
+		smp_rendezvous_cpus_oneaction(map, rv, nhold, setup_func, arg);
+		smp_rendezvous_wait(rv);
+		nhold = 0;
+	}
 
-		KASSERT(
-		    (DPCPU_ID_GET(cpu, smp_rv_senders) & (1 << curcpu)) == 0,
-		    ("curcpu bit is set in target cpu's senders map"));
-
-		/* if we are the first to send an event, then send an ipi */
-		if (atomic_fetchadd_int(DPCPU_ID_PTR(cpu, smp_rv_count), 1)
-		    == 0)
-			ipi_cpu(cpu, IPI_RENDEZVOUS);
-		else
-			coalesced_ipi_count++;
+	smp_rendezvous_cpus_oneaction(map, rv, nhold, action_func, arg);
 
-		atomic_set_rel_int(DPCPU_ID_PTR(cpu, smp_rv_senders),
-		    1 << curcpu);
+	/*
+	 * For now be compatible with historic smp_rendezvous semantics:
+	 * if teardown_func is smp_no_rendevous_barrier, then the master
+	 * CPU waits for target CPUs to complete main action.
+	 * This means that we do not support completely async semantics
+	 * (where the master CPU "fires and forgets" for time being..
+	 */
+#ifdef notyet
+	if (teardown_func != smp_no_rendevous_barrier) {
+		smp_rendezvous_wait(rv);
+		smp_rendezvous_cpus_oneaction(map, rv, 0, teardown_func, arg);
 	}
+#else
+	smp_rendezvous_wait(rv);
+	if (teardown_func != smp_no_rendevous_barrier)
+		smp_rendezvous_cpus_oneaction(map, rv, 0, teardown_func, arg);
+#endif
+	/* We are done with out work. */
+	atomic_add_int(DPCPU_PTR(smp_rv_count), -1);
 
-	/* Check if the current CPU is in the map */
-	if ((map & (1 << curcpu)) != 0)
-		smp_rendezvous_action_body(curcpu);
-	if (teardown_func == smp_no_rendevous_barrier)
-		smp_rendezvous_wait();
+	/* Process all pending incoming actions. */
+	smp_rendezvous_action();
 
 	spinlock_exit();
 }