Date: Sun, 15 May 2011 11:13:48 +0300 From: Andriy Gapon <avg@FreeBSD.org> To: Max Laier <max@love2party.net> Cc: freebsd-current@FreeBSD.org, Peter Grehan <grehan@FreeBSD.org>, John Baldwin <jhb@FreeBSD.org> Subject: Re: proposed smp_rendezvous change Message-ID: <4DCF8B3C.2060008@FreeBSD.org> In-Reply-To: <201105150033.40788.max@love2party.net> References: <4DCD357D.6000109@FreeBSD.org> <4DCE9EF0.3050803@FreeBSD.org> <201105150033.40788.max@love2party.net>
next in thread | previous in thread | raw e-mail | index | archive | help
on 15/05/2011 07:33 Max Laier said the following: > On Saturday 14 May 2011 11:25:36 John Baldwin wrote: >> On 5/13/11 9:43 AM, Andriy Gapon wrote: >>> This is a change in vein of what I've been doing in the xcpu branch and >>> it's supposed to fix the issue by the recent commit that (probably >>> unintentionally) stress-tests smp_rendezvous in TSC code. >>> >>> Non-essential changes: >>> - ditch initial, and in my opinion useless, pre-setup rendezvous in >>> smp_rendezvous_action() >> >> As long as IPIs ensure all data is up to date (I think this is certainly >> true on x86) that is fine. Presumably sending an IPI has an implicit >> store barrier on all other platforms as well? > > Note that if the barrier is required on any platform, then we have to move all > the initializations after the _acq, otherwise it is pointless indeed. Yeah, I still would like to get an explanation why that synchronization is needed. I can understand "better safe than sorry" mentality, I often stick to it myself. But for the benefit of improving objective knowledge it would be useful to understand why that spin-waiting could it useful, especially given your observation about where the assignment are made. >>> Essential changes (the fix): >>> - re-use freed smp_rv_waiters[2] to indicate that a slave/target is >>> really fully done with rendezvous (i.e. it's not going to access any >>> members of smp_rv_* pseudo-structure) >>> - spin on smp_rv_waiters[2] upon _entry_ to smp_rendezvous_cpus() to not >>> re-use the smp_rv_* pseudo-structure too early >> >> Hmmm, so this is not actually sufficient. NetApp ran into a very >> similar race with virtual CPUs in BHyVe. In their case because virtual >> CPUs are threads that can be preempted, they have a chance at a longer >> race. >> >> The problem that they see is that even though the values have been >> updated, the next CPU to start a rendezvous can clear smp_rv_waiters[2] >> to zero before one of the other CPUs notices that it has finished. >> >> A more concrete example: >> >> Suppose you have 3 CPUs doing a rendezvous where CPU 1 is the initiator. >> >> All three CPUs run the rendezvous down to the point of increasing >> smp_rv_waiters[2]. CPU 1 is the last one to rendezvous for some reason >> so he notices smp_rv_waiters[2] being correct first and exits the >> rendezvous. It immediately does a new rendezvous. Even with your >> change, it will see that smp_rv_waiters[2] is correct and will proceed >> to clear it before starting the next rendezvous. >> >> Meanwhile one of the other CPUs never sees the final update by CPU 1 to >> smp_rv_waiters[2], instead it sees the value go from 2 to 0 (e.g. if CPU >> 1's two writes were coalesced, or in the case of the hypervisor, CPU 2 >> or 3's backing thread is preempted and both writes have posted before >> the thread backing CPU 2 or 3 gets to run again). >> >> At this point that CPU (call it CPU 2) will spin forever. When CPU 1 >> sends a second rendezvous IPI it will be held in the local APIC of CPU 2 >> because CPU 2 hasn't EOI'd the first IPI, and so CPU 2 will never bump >> smp_rv_waiters[2] and the entire system will deadlock. >> >> NetApp's solution is to add a monotonically increasing generation count >> to the rendezvous data set, which is cached in the rendezvous handler >> and to exit the last synchronization point if either smp_rv_waiters[2] >> is high enough, or the generation count has changed. >> >> I think this would also handle the case the TSC changes have provoked. >> I'm not sure if this would be sufficient for the error case Max Laier >> has encountered. > > It seems to address the issue, albeit a bit difficult to understand why this > is correct. I'd like to see some comments in the code instead of just the > commit log, but otherwise ... please go ahead with this. I think it works approximately the same as what I suggested. Only my patch forces the next master CPU to wait until all slave CPUs are fully done with the previous rendezvous (and ware of that fact), while this patch provides a "side-channel" to tell late slave CPUs that their rendezvous was completed when there is a new rendezvous started by a new master CPU. >> NetApp's patch: >> >> Extra protection for consecutive smp_rendezvous() calls. >> >> We need this because interrupts are not really disabled when >> executing the smp_rendezvous_action() when running inside a >> virtual machine. >> >> In particular it is possible for the last cpu to increment >> smp_rv_waiters[2] so that the exit rendezvous condition is >> satisfied and then get interrupted by a hardware interrupt. >> >> When the execution of the interrupted vcpu continues it is >> possible that one of the other vcpus that did *not* get >> interrupted exited the old smp_rendezvous() and started a >> new smp_rendezvous(). In this case 'smp_rv_waiters[2]' is >> again reset to 0. This would mean that the vcpu that got >> interrupted will spin forever on the exit rendezvous. >> >> We protect this by spinning on the exit rendezvous only if >> the generation of the smp_rendezvous() matches what we >> started out with before incrementing 'smp_rv_waiters[2]'. >> >> Differences ... >> >> ==== //private/xxxx/sys/kern/subr_smp.c#3 (text) ==== >> >> @@ -127,6 +127,7 @@ >> static void (*volatile smp_rv_teardown_func)(void *arg); >> static void * volatile smp_rv_func_arg; >> static volatile int smp_rv_waiters[3]; >> +static volatile int smp_rv_generation; >> >> /* >> * Shared mutex to restrict busywaits between smp_rendezvous() and >> @@ -418,6 +419,7 @@ >> void >> smp_rendezvous_action(void) >> { >> + int generation; >> void* local_func_arg = smp_rv_func_arg; >> void (*local_setup_func)(void*) = smp_rv_setup_func; >> void (*local_action_func)(void*) = smp_rv_action_func; >> @@ -457,11 +459,14 @@ >> if (local_action_func != NULL) >> local_action_func(local_func_arg); >> >> + generation = atomic_load_acq_int(&smp_rv_generation); >> + >> /* spin on exit rendezvous */ >> atomic_add_int(&smp_rv_waiters[2], 1); >> if (local_teardown_func == smp_no_rendevous_barrier) >> return; >> - while (smp_rv_waiters[2] < smp_rv_ncpus) >> + while (smp_rv_waiters[2] < smp_rv_ncpus && >> + generation == smp_rv_generation) >> cpu_spinwait(); >> >> /* >> @@ -565,6 +570,8 @@ >> /* obtain rendezvous lock */ >> mtx_lock_spin(&smp_ipi_mtx); >> >> + atomic_add_acq_int(&smp_rv_generation, 1); >> + >> /* set static function pointers */ >> smp_rv_ncpus = ncpus; >> smp_rv_setup_func = setup_func; -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4DCF8B3C.2060008>