From owner-freebsd-current@FreeBSD.ORG Sat May 14 15:25:38 2011 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 50364106566C; Sat, 14 May 2011 15:25:38 +0000 (UTC) (envelope-from jhb@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2779F8FC15; Sat, 14 May 2011 15:25:38 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id B659046B55; Sat, 14 May 2011 11:25:37 -0400 (EDT) Received: from John-Baldwins-Macbook-Pro.local (unknown [192.75.139.253]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 153828A04F; Sat, 14 May 2011 11:25:37 -0400 (EDT) Message-ID: <4DCE9EF0.3050803@FreeBSD.org> Date: Sat, 14 May 2011 11:25:36 -0400 From: John Baldwin User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 MIME-Version: 1.0 To: Andriy Gapon References: <4DCD357D.6000109@FreeBSD.org> In-Reply-To: <4DCD357D.6000109@FreeBSD.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Sat, 14 May 2011 11:25:37 -0400 (EDT) Cc: FreeBSD current , Peter Grehan Subject: Re: proposed smp_rendezvous change X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 14 May 2011 15:25:38 -0000 On 5/13/11 9:43 AM, Andriy Gapon wrote: > > This is a change in vein of what I've been doing in the xcpu branch and it's > supposed to fix the issue by the recent commit that (probably unintentionally) > stress-tests smp_rendezvous in TSC code. > > Non-essential changes: > - ditch initial, and in my opinion useless, pre-setup rendezvous in > smp_rendezvous_action() As long as IPIs ensure all data is up to date (I think this is certainly true on x86) that is fine. Presumably sending an IPI has an implicit store barrier on all other platforms as well? > Essential changes (the fix): > - re-use freed smp_rv_waiters[2] to indicate that a slave/target is really fully > done with rendezvous (i.e. it's not going to access any members of smp_rv_* > pseudo-structure) > - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not re-use > the smp_rv_* pseudo-structure too early Hmmm, so this is not actually sufficient. NetApp ran into a very similar race with virtual CPUs in BHyVe. In their case because virtual CPUs are threads that can be preempted, they have a chance at a longer race. The problem that they see is that even though the values have been updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] to zero before one of the other CPUs notices that it has finished. A more concrete example: Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. All three CPUs run the rendezvous down to the point of increasing smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason so he notices smp_rv_waiters[2] being correct first and exits the rendezvous. It immediately does a new rendezvous. Even with your change, it will see that smp_rv_waiters[2] is correct and will proceed to clear it before starting the next rendezvous. Meanwhile one of the other CPUs never sees the final update by CPU 1 to smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 or 3's backing thread is preempted and both writes have posted before the thread backing CPU 2 or 3 gets to run again). At this point that CPU (call it CPU 2) will spin forever. When CPU 1 sends a second rendezvous IPI it will be held in the local APIC of CPU 2 because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump smp_rv_waiters[2] and the entire system will deadlock. NetApp's solution is to add a monotonically increasing generation count to the rendezvous data set, which is cached in the rendezvous handler and to exit the last synchronization point if either smp_rv_waiters[2] is high enough, or the generation count has changed. I think this would also handle the case the TSC changes have provoked. I'm not sure if this would be sufficient for the error case Max Laier has encountered. NetApp's patch: Extra protection for consecutive smp_rendezvous() calls. We need this because interrupts are not really disabled when executing the smp_rendezvous_action() when running inside a virtual machine. In particular it is possible for the last cpu to increment smp_rv_waiters[2] so that the exit rendezvous condition is satisfied and then get interrupted by a hardware interrupt. When the execution of the interrupted vcpu continues it is possible that one of the other vcpus that did *not* get interrupted exited the old smp_rendezvous() and started a new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is again reset to 0. This would mean that the vcpu that got interrupted will spin forever on the exit rendezvous. We protect this by spinning on the exit rendezvous only if the generation of the smp_rendezvous() matches what we started out with before incrementing 'smp_rv_waiters[2]'. Differences ... ==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ==== @@ -127,6 +127,7 @@ static void (*volatile smp_rv_teardown_func)(void *arg); static void * volatile smp_rv_func_arg; static volatile int smp_rv_waiters[3]; +static volatile int smp_rv_generation; /* * Shared mutex to restrict busywaits between smp_rendezvous() and @@ -418,6 +419,7 @@ void smp_rendezvous_action(void) { + int generation; void* local_func_arg = smp_rv_func_arg; void (*local_setup_func)(void*) = smp_rv_setup_func; void (*local_action_func)(void*) = smp_rv_action_func; @@ -457,11 +459,14 @@ if (local_action_func != NULL) local_action_func(local_func_arg); + generation = atomic_load_acq_int(&smp_rv_generation); + /* spin on exit rendezvous */ atomic_add_int(&smp_rv_waiters[2], 1); if (local_teardown_func == smp_no_rendevous_barrier) return; - while (smp_rv_waiters[2] < smp_rv_ncpus) + while (smp_rv_waiters[2] < smp_rv_ncpus && + generation == smp_rv_generation) cpu_spinwait(); /* @@ -565,6 +570,8 @@ /* obtain rendezvous lock */ mtx_lock_spin(&smp_ipi_mtx); + atomic_add_acq_int(&smp_rv_generation, 1); + /* set static function pointers */ smp_rv_ncpus = ncpus; smp_rv_setup_func = setup_func; -- John Baldwin