From owner-freebsd-current@FreeBSD.ORG  Mon May 16 15:52:12 2011
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 3159B106566C;
	Mon, 16 May 2011 15:52:12 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 06EBF8FC1F;
	Mon, 16 May 2011 15:52:12 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id ABF3546B3B;
	Mon, 16 May 2011 11:52:11 -0400 (EDT)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 4F6DC8A051;
	Mon, 16 May 2011 11:52:11 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Andriy Gapon <avg@freebsd.org>
Date: Mon, 16 May 2011 11:52:10 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110325; KDE/4.5.5; amd64; ; )
References: <4DCD357D.6000109@FreeBSD.org> <4DCFEE33.5090808@FreeBSD.org>
	<4DD0011D.40000@FreeBSD.org>
In-Reply-To: <4DD0011D.40000@FreeBSD.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201105161152.10458.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Mon, 16 May 2011 11:52:11 -0400 (EDT)
Cc: Max Laier <max@love2party.net>,
	FreeBSD current <freebsd-current@freebsd.org>, neel@freebsd.org,
	Peter Grehan <grehan@freebsd.org>
Subject: Re: proposed smp_rendezvous change
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 16 May 2011 15:52:12 -0000

On Sunday, May 15, 2011 12:36:45 pm Andriy Gapon wrote:
> on 15/05/2011 18:16 John Baldwin said the following:
> > On 5/15/11 10:53 AM, Andriy Gapon wrote:
> >> on 15/05/2011 10:12 Andriy Gapon said the following:
> >>> on 14/05/2011 18:25 John Baldwin said the following:
> >>>> Hmmm, so this is not actually sufficient.  NetApp ran into a very similar race
> >>>> with virtual CPUs in BHyVe.  In their case because virtual CPUs are threads
> >>>> that
> >>>> can be preempted, they have a chance at a longer race.
> >>>>
> >>>> The problem that they see is that even though the values have been updated, the
> >>>> next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before
> >>>> one of
> >>>> the other CPUs notices that it has finished.
> >>>
> >>> As a follow up to my previous question.  Have you noticed that in my patch no
> >>> slave CPU actually waits/spins on smp_rv_waiters[2]?  It's always only master
> >>> CPU (and under smp_ipi_mtx).
> >>>
> >>
> >> Here's a cleaner version of my approach to the fix.
> >> This one does not remove the initial wait on smp_rv_waiters[0] in
> >> smp_rendezvous_action() and thus does not renumber all smp_rv_waiters[] members
> >> and thus hopefully should be clearer.
> >>
> >> Index: sys/kern/subr_smp.c
> >> ===================================================================
> >> --- sys/kern/subr_smp.c    (revision 221943)
> >> +++ sys/kern/subr_smp.c    (working copy)
> >> @@ -110,7 +110,7 @@ static void (*volatile smp_rv_setup_func)(void *ar
> >>   static void (*volatile smp_rv_action_func)(void *arg);
> >>   static void (*volatile smp_rv_teardown_func)(void *arg);
> >>   static void *volatile smp_rv_func_arg;
> >> -static volatile int smp_rv_waiters[3];
> >> +static volatile int smp_rv_waiters[4];
> >>
> >>   /*
> >>    * Shared mutex to restrict busywaits between smp_rendezvous() and
> >> @@ -338,11 +338,15 @@ smp_rendezvous_action(void)
> >>
> >>       /* spin on exit rendezvous */
> >>       atomic_add_int(&smp_rv_waiters[2], 1);
> >> -    if (local_teardown_func == smp_no_rendevous_barrier)
> >> +    if (local_teardown_func == smp_no_rendevous_barrier) {
> >> +        atomic_add_int(&smp_rv_waiters[3], 1);
> >>                   return;
> >> +    }
> >>       while (smp_rv_waiters[2]<  smp_rv_ncpus)
> >>           cpu_spinwait();
> >>
> >> +    atomic_add_int(&smp_rv_waiters[3], 1);
> >> +
> >>       /* teardown function */
> >>       if (local_teardown_func != NULL)
> >>           local_teardown_func(local_func_arg);
> >> @@ -377,6 +381,9 @@ smp_rendezvous_cpus(cpumask_t map,
> >>       /* obtain rendezvous lock */
> >>       mtx_lock_spin(&smp_ipi_mtx);
> >>
> >> +    while (smp_rv_waiters[3]<  smp_rv_ncpus)
> >> +        cpu_spinwait();
> >> +
> >>       /* set static function pointers */
> >>       smp_rv_ncpus = ncpus;
> >>       smp_rv_setup_func = setup_func;
> >> @@ -385,6 +392,7 @@ smp_rendezvous_cpus(cpumask_t map,
> >>       smp_rv_func_arg = arg;
> >>       smp_rv_waiters[1] = 0;
> >>       smp_rv_waiters[2] = 0;
> >> +    smp_rv_waiters[3] = 0;
> >>       atomic_store_rel_int(&smp_rv_waiters[0], 0);
> >>
> >>       /* signal other processors, which will enter the IPI with interrupts off */
> > 
> > Ahh, so the bump is after the change.  I do think this will still be ok
> > and I probably just didn't explain it well to Neel.  I wonder though
> > if the bump shouldn't happen until after the call of the local teardown
> > function?
> 
> That should not be necessary according to my understanding unless we have a
> reason to care about potentially disappearing function text.  Teardown function
> pointer and argument are locally cached and I don't see any problems with a new
> rendezvous being setup and run while teardowns of the previous one are still
> running.
> 
> On the other hand, you might be onto something here.  If the argument is
> allocated in temporary storage (e.g. on stack) and the teardown function
> actually uses the argument, then it's not safe to deallocate the argument until
> all of the teardowns are done.  In this case the master CPU should wait not only
> until all actions are done, but also until all teardowns are done.
> The current code doesn't do that.  And I am not sure if we actually would want
> to add this safety net or just document the behavior and put a burden on
> smp_rendezvous API users.

As we chatted about on IRC, I think we should err on the side of making this
API less complex (so that all of the actions for a given rendezvous finish
before the next rendezvous begins) as this stuff is already fairly complex.
We should only add more complexity if we really need it.

> BTW, I don't think that we ever use smp_rendezvous() in its full glory, i.e.
> with non-trivial setup, action and teardown (except perhaps for jkim's stress
> test).  OpenSolaris seems to be doing just fine with the following three types
> of simpler CPU cross-calls (as they call them):
> 1. async - a master CPU requests an action to be executed on target CPUs but
> doesn't wait for anything
> - we have no explicit equivalent for this
> 
> 2. call - a master CPU requests an action to be executed on target CPUs and
> waits until all the target CPUs have finished executing it
> - this would be equivalent smp_rendezvous_cpus(no_barrier, action, no_barrier)
> 
> 3. synch - a master CPU requests an action to be executed on target CPUs and
> waits until all the target CPUs have finished executing it and the slave CPUs
> are also waiting for their siblings to be done
> - this would be equivalent smp_rendezvous_cpus(no_barrier, action, NULL)
> 
> Seems that our current code doesn't use/need more than that as well.
> But I really like the capability that smp_rendezvous_cpus() potentially provides.

I would be fine with stripping down our rendezvous API a bit.  They could all
still use a single IPI and handler but perhaps prevent a few wrapper APIs that
provide simpler semantics like this.  Back to my statement above, we've had
this uber-generic rendezvous API for a while, but if we don't really need it to
be quite so complex I'd be happy to dumb down the API to what we really need so
the implementation and use can be simpler.

-- 
John Baldwin