From owner-freebsd-current@FreeBSD.ORG Sun May 15 04:33:53 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 5A885106564A; Sun, 15 May 2011 04:33:53 +0000 (UTC) (envelope-from max@love2party.net) Received: from moutng.kundenserver.de (moutng.kundenserver.de [212.227.126.187]) by mx1.freebsd.org (Postfix) with ESMTP id 063808FC12; Sun, 15 May 2011 04:33:52 +0000 (UTC) Received: from maxlap.localnet ([24.114.252.238]) by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis) id 0MWOFC-1QEGxw3gvS-00XYSG; Sun, 15 May 2011 06:33:51 +0200 From: Max Laier To: freebsd-current@freebsd.org Date: Sun, 15 May 2011 00:33:40 -0400 User-Agent: KMail/1.13.6 (FreeBSD/9.0-CURRENT; KDE/4.6.1; amd64; ; ) References: <4DCD357D.6000109@FreeBSD.org> <4DCE9EF0.3050803@FreeBSD.org> In-Reply-To: <4DCE9EF0.3050803@FreeBSD.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201105150033.40788.max@love2party.net> X-Provags-ID: V02:K0:owSsRojbawdFjXdb0E/8J6XoOlWr6cYjdpFF246qd+H Ei96iisn86Hbfz9aO2JIuZxDqk1QW+FnwUvY3gc2+MBPSvi5Jl 20reUIVc0zRn+jJZufURXVIVqktitJPYYMCMTWMqLCi/L2S2TE 9SW3Fwhxo4iqadgLSQPIvVqyzackGUhWRZExGFgUzBAx61OuVj SkqlLZAi1/4gp3QgmpyWA== Cc: Andriy Gapon , Peter Grehan Subject: Re: proposed smp_rendezvous change X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 May 2011 04:33:53 -0000 On Saturday 14 May 2011 11:25:36 John Baldwin wrote: > On 5/13/11 9:43 AM, Andriy Gapon wrote: > > This is a change in vein of what I've been doing in the xcpu branch and > > it's supposed to fix the issue by the recent commit that (probably > > unintentionally) stress-tests smp_rendezvous in TSC code. > > > > Non-essential changes: > > - ditch initial, and in my opinion useless, pre-setup rendezvous in > > smp_rendezvous_action() > > As long as IPIs ensure all data is up to date (I think this is certainly > true on x86) that is fine. Presumably sending an IPI has an implicit > store barrier on all other platforms as well? Note that if the barrier is required on any platform, then we have to move all the initializations after the _acq, otherwise it is pointless indeed. > > > Essential changes (the fix): > > - re-use freed smp_rv_waiters[2] to indicate that a slave/target is > > really fully done with rendezvous (i.e. it's not going to access any > > members of smp_rv_* pseudo-structure) > > - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not > > re-use the smp_rv_* pseudo-structure too early > > Hmmm, so this is not actually sufficient. NetApp ran into a very > similar race with virtual CPUs in BHyVe. In their case because virtual > CPUs are threads that can be preempted, they have a chance at a longer > race. > > The problem that they see is that even though the values have been > updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] > to zero before one of the other CPUs notices that it has finished. > > A more concrete example: > > Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. > > All three CPUs run the rendezvous down to the point of increasing > smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason > so he notices smp_rv_waiters[2] being correct first and exits the > rendezvous. It immediately does a new rendezvous. Even with your > change, it will see that smp_rv_waiters[2] is correct and will proceed > to clear it before starting the next rendezvous. > > Meanwhile one of the other CPUs never sees the final update by CPU 1 to > smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU > 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 > or 3's backing thread is preempted and both writes have posted before > the thread backing CPU 2 or 3 gets to run again). > > At this point that CPU (call it CPU 2) will spin forever. When CPU 1 > sends a second rendezvous IPI it will be held in the local APIC of CPU 2 > because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump > smp_rv_waiters[2] and the entire system will deadlock. > > NetApp's solution is to add a monotonically increasing generation count > to the rendezvous data set, which is cached in the rendezvous handler > and to exit the last synchronization point if either smp_rv_waiters[2] > is high enough, or the generation count has changed. > > I think this would also handle the case the TSC changes have provoked. > I'm not sure if this would be sufficient for the error case Max Laier > has encountered. It seems to address the issue, albeit a bit difficult to understand why this is correct. I'd like to see some comments in the code instead of just the commit log, but otherwise ... please go ahead with this. I'll open a new thread with the second issue we have been seeing wrt rmlocks, which is related, but different. > NetApp's patch: > > Extra protection for consecutive smp_rendezvous() calls. > > We need this because interrupts are not really disabled when > executing the smp_rendezvous_action() when running inside a > virtual machine. > > In particular it is possible for the last cpu to increment > smp_rv_waiters[2] so that the exit rendezvous condition is > satisfied and then get interrupted by a hardware interrupt. > > When the execution of the interrupted vcpu continues it is > possible that one of the other vcpus that did *not* get > interrupted exited the old smp_rendezvous() and started a > new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is > again reset to 0. This would mean that the vcpu that got > interrupted will spin forever on the exit rendezvous. > > We protect this by spinning on the exit rendezvous only if > the generation of the smp_rendezvous() matches what we > started out with before incrementing 'smp_rv_waiters[2]'. > > Differences ... > > ==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ==== > > @@ -127,6 +127,7 @@ > static void (*volatile smp_rv_teardown_func)(void *arg); > static void * volatile smp_rv_func_arg; > static volatile int smp_rv_waiters[3]; > +static volatile int smp_rv_generation; > > /* > * Shared mutex to restrict busywaits between smp_rendezvous() and > @@ -418,6 +419,7 @@ > void > smp_rendezvous_action(void) > { > + int generation; > void* local_func_arg = smp_rv_func_arg; > void (*local_setup_func)(void*) = smp_rv_setup_func; > void (*local_action_func)(void*) = smp_rv_action_func; > @@ -457,11 +459,14 @@ > if (local_action_func != NULL) > local_action_func(local_func_arg); > > + generation = atomic_load_acq_int(&smp_rv_generation); > + > /* spin on exit rendezvous */ > atomic_add_int(&smp_rv_waiters[2], 1); > if (local_teardown_func == smp_no_rendevous_barrier) > return; > - while (smp_rv_waiters[2] < smp_rv_ncpus) > + while (smp_rv_waiters[2] < smp_rv_ncpus && > + generation == smp_rv_generation) > cpu_spinwait(); > > /* > @@ -565,6 +570,8 @@ > /* obtain rendezvous lock */ > mtx_lock_spin(&smp_ipi_mtx); > > + atomic_add_acq_int(&smp_rv_generation, 1); > + > /* set static function pointers */ > smp_rv_ncpus = ncpus; > smp_rv_setup_func = setup_func;