Date: Wed, 10 Jun 2015 01:53:04 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kib@freebsd.org> Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r284178 - head/sys/kern Message-ID: <20150609233811.L2610@besplex.bde.org> In-Reply-To: <201506091149.t59BnulP034734@svn.freebsd.org> References: <201506091149.t59BnulP034734@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 9 Jun 2015, Konstantin Belousov wrote: > Log: > When updating/accessing the timehands, barriers are needed to ensure > that: > - th_generation update is visible after the parameters update is > visible; > - the read of parameters is not reordered before initial read of > th_generation. This adds pessimizations that were intentionally left out. > On UP kernels, compiler barriers are enough. For SMP machines, CPU > barriers must be used too, as was confirmed by submitter by testing on > the Freescale T4240 platform with 24 PowerPC processors. Some SMP machines need something. Hopefully not most, and never as heavyweight as this. > Modified: head/sys/kern/kern_tc.c > ============================================================================== > --- head/sys/kern/kern_tc.c Tue Jun 9 11:41:37 2015 (r284177) > +++ head/sys/kern/kern_tc.c Tue Jun 9 11:49:56 2015 (r284178) > @@ -34,6 +34,7 @@ __FBSDID("$FreeBSD$"); > #include <sys/timetc.h> > #include <sys/timex.h> > #include <sys/vdso.h> > +#include <machine/atomic.h> Style bug. <machine/atomic.h> is standard pollution in <sys/param.h>. and this is relied on in some of the headers already included, e.g., in <sys/mutex.h>. Timecounters have much larger races than this. E.g., kern_ntptime.c is supposed to be locked by splhigh(), but is actually locked by Giant in its less critical parts and by nothing except the null splhigh() in its most critical parts. It is called from fast interrupt handers with whatever thre locking is. Ntptime knows nothing of this locking, so it doesn't work. The must usual call is ntp_update_second from the hardclock fast interrupt handler. There are also calls from arbitrary drivers with pps support. These call into kern_tc.c with whatever locking the driver has. These have broken locking using the generation could to protect the timecounter parts (see below), and null locking to protect the ntptime parts (mainly hardpps()). > @@ -189,6 +190,33 @@ tc_delta(struct timehands *th) > tc->tc_counter_mask); > } > > +static u_int > +tc_getgen(struct timehands *th) > +{ > + > +#ifdef SMP > + return (atomic_load_acq_int(&th->th_generation)); > +#else > + u_int gen; > + > + gen = th->th_generation; > + __compiler_membar(); > + return (gen); > +#endif > +} This is an enormous pessimization unless it is automatically inlined. Since it is called more than once, only clang and gcc -O3 automatically inlines it. A simple load operation from a pointer becomes a function call and sometimes more. > + > +static void > +tc_setgen(struct timehands *th, u_int newgen) > +{ > + > +#ifdef SMP > + atomic_store_rel_int(&th->th_generation, newgen); > +#else > + __compiler_membar(); > + th->th_generation = newgen; > +#endif > +} Same pessimization for simple stores. These pessimizations are especially good since the functions are called in loops (though usually just twice). Calling them in loops is an excuse for clang to inline them. But clang's inlining makes little difference except to break debugging in most places in the kernel. I avoid it and other misoptimizations using gcc -O1 -fno-inline-functions-called once. > @@ -1651,13 +1679,13 @@ pps_capture(struct pps_state *pps) > > KASSERT(pps != NULL, ("NULL pps pointer in pps_capture")); > th = timehands; > - pps->capgen = th->th_generation; > + pps->capgen = tc_getgen(th); This is part of the broken locking for pps. The generation count is fragile and only works if you check it before committing to a change and retry if it changed. But the pps API doesn't support retrying. > pps->capth = th; > #ifdef FFCLOCK > pps->capffth = fftimehands; > #endif > pps->capcount = th->th_counter->tc_get_timecount(th->th_counter); > - if (pps->capgen != th->th_generation) > + if (pps->capgen != tc_getgen(th)) > pps->capgen = 0; This check is in the middle of pps activity. Looping to retrying would be easy enough, but isn't done. Anyway, the race window is much larger later. > } > > @@ -1677,7 +1705,7 @@ pps_event(struct pps_state *pps, int eve > > KASSERT(pps != NULL, ("NULL pps pointer in pps_event")); > /* If the timecounter was wound up underneath us, bail out. */ > - if (pps->capgen == 0 || pps->capgen != pps->capth->th_generation) > + if (pps->capgen == 0 || pps->capgen != tc_getgen(pps->capth)) > return; Normal use of pps is pps_capture() followed by pps_event(). The above is broken locking for the latter... > > /* Things would be easier with arrays. */ > @@ -1727,7 +1755,7 @@ pps_event(struct pps_state *pps, int eve > bintime2timespec(&bt, &ts); > > /* If the timecounter was wound up underneath us, bail out. */ > - if (pps->capgen != pps->capth->th_generation) > + if (pps->capgen != tc_getgen(pps->capth)) > return; We also check in the middle, and give up instead of retrying. Not too bad (I think ntp can handle rare missed pps's). > > *pcount = pps->capcount; But then we continue without further locking. Perhaps we did enough to stabilize the timecounter while accessing it, but we don't do anything extra to stabilize *pps. pps stuff is basically locked by the generation count. This locking (except for the ntptime parts) is more complicated and less broken than might first appear. Note the comment: /* If the timecounter was wound up underneath us, bail out. */ I think this is correct and it takes a full timecounter windup for the generation cound to change. That is, cycling through all 10 timehands. With HZ = 100, than is 100 msec, and with HZ = 1000 it is 10 msec which is still a long time. It takes weird scheduling or stopping in a debugger (which is a type of weird scheduling) for the timehands to change more than once under the thread that has "captured" a generation using the generation count. For pps_capture(), the thread doing the capturing should be at least a slow interrupt handler, so it shouldn't be delayed for long. Hmm, actually it is easy to think of a not very weird configurataion where it is delayed for that long: - a slow UP system - misconfigured with HZ = 1000 so the windup occurs in only 10 msec - pps capture called from a slow interrupt handler (like mine for the RTC) - then the pps interrupt handler can be preempted by other interrupts, and it is easy for them to sometimes take more than 10 msec to complete. Reading of the timecounter uses essentially a passive form of capture, Since it is usually done in syscalls, the caller can easily be preempted for 1 full timecounter windup time. They handle this by retrying. The large interval of 10 timehands cycles is supposed to make atomic ops unnecessary, and even memory ordering unnecessary in practice. It is only necessary for changes to the generation to not become visible before changes to the data and not too many generations after changes to the data (even 8 or 9 generations late would work, but reduces the effective number of generations to 2 or 1). The latter happens automatically with locking on return from hardclock() if not before. Perhaps this is intentional. But tc_windup() makes no attempt to make the data visible before the generation count changes. It basically does: th = oldest timehands // normally nothing using this th->th_generation = 0 // missing ordering start hacking on th // missing ordering This loses if something is using th, AND the memory ordering is not program order, AND something keeps sees the changed parts of th before seeing the th_generation change. ... finish hacking on th th->th_generation = next generation // missing ordering This loses in the opposite way if the change to the generation becomes visible before the data, AND something is using the oldest th, etc. timehands = th // missing ordering This loses for the new current generation, if the change to timehands becomes visible before the changes to *th (except if the visible change to the generation is to 0 but not to the new generation). This race is more likely to be lost than the others, since it doesn't take the unlikely setup with the oldest timehands till active. If there is a race window at all (due to out of order memory visibility for critical parts of the changed timehands), then any thread that starts and finishes reading the timecounter while its memory is disordered will lose the race. This race window is just fairly short, since tc_windup() soon returns and the interrupt handler does some atomic ops which sync the order on most or all arches. Without that, the order would probably be synced by unrelated memory activity before 10 timehands cycles, but the current timehands is supposed to be synced already. The change in this commit is sort of backwards. It pessimizes all timecounter reads to work around the missing ordering in timecounter windups. It should be arranged that even if a timecounter read sees a stale timehands, the timehands remains for long enough. Surely this is possible. Timecounter windups of course sync everything. Then when a timecounter read checks the generation count, I think that on some arches the sync in windup means that the reader sees the up to date value. Otherwise, the sync just has to occur within before 10 cycles of timehands updates. At worse, this can be done by an ordered read of all timecounter memory once every 9 cycles of timehands updates. Except, there would be problems if this read were delayed. So tc_windup() should tell other CPUs to sync, and wait for them. The other gross bugs in timecounter locking are mainly calls to tc_windup() from clock_settime(CLOCK_REALTIME) and sysctl kern.timecounter.hardware. These race with calls from hardclock(). hardclock()'s locking defends against interference from hardclock() only (except in UP when hardclock() is a fast interrupt handler). It isn't clear what happens, but at best tc_windup() will cycle through the timehands at an unsuported rate. This is not a large problem since it takes foot-shooting to activate. Ntp prefers to do micro-adjustments and almost never calls clock_settime(). clock_settime() is rate-limited to once per second at securelevel >= 2, but we now know that securelevel was a mistake and there are better methods so this configuration is probably rarer now than ever. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150609233811.L2610>