Date: Wed, 6 Mar 2019 19:20:03 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190306172003.GD2492@kib.kiev.ua> In-Reply-To: <20190305031010.I4610@besplex.bde.org> References: <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote: > On Mon, 4 Mar 2019, Konstantin Belousov wrote: > > > On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote: > >> On Sun, 3 Mar 2019, Konstantin Belousov wrote: > >> > >>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote: > >>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote: > >>>> > >>>>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote: > >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >>>>>> > >>>>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote: > >>>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >>> * ... > >>>> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining > >>>> step. i386 used to be faster here -- the first masking step of discarding > >>>> %edx doesn't take any code. amd64 has to mask out the top bits in %rax. > >>>> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64 > >>>> has to do a not so slow shr. > >>> i386 cannot discard %edx after RDTSC since some bits from %edx come into > >>> the timecounter value. > >> > >> These bits are part of the tsc-low pessimization. The shift count should > >> always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX > >> sometimes. > >> > >> When tsc-low was new, the shift count was often larger (as much as 8), > >> and it is still changeable by a read-only tunable, but now it is 1 in > >> almost all cases. The code only limits the timecounter frequency > >> to UINT_MAX, except the tunable defaults to 1 so average CPUs running > >> at nearly 4 GHz are usually limited to about 2 GHz. The comment about > >> this UINT_MAX doesn't match the code. The comment says int, but the > >> code says UINT. > >> > >> All that a shoft count of 1 does is waste time to lose 1 bit of accuracy. > >> This much accuracy is noise for most purposes. > >> > >> The tunable is fairly undocumented. Its description is "Shift to apply > >> for the maximum TSC frequency". Of course, it has no effect on the TSC > >> frequency. It only affects the TSC timecounter frequency. > > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy. > > Otherwise, I think, some multi-socket machines would start showing the > > detectable backward-counting bintime(). At the frequencies at 4GHz and > > above (Intel has 5Ghz part numbers) I do not think that stability of > > 100MHz crystall and on-board traces is enough to avoid that. > > I think it is just a kludge that reduced the problem before it was fixed > properly using fences. > > Cross-socket latency is over 100 cycles according to jhb's tscskew > benchmark: on Haswell 4x2: > > CPU | TSC skew (min/avg/max/stddev) > ----+------------------------------ > 0 | 0 0 0 0.000 > 1 | 24 49 84 14.353 > 2 | 164 243 308 47.811 > 3 | 164 238 312 47.242 > 4 | 168 242 332 49.593 > 5 | 168 243 324 48.722 > 6 | 172 242 320 52.596 > 7 | 172 240 316 53.014 > > freefall is similar. Latency is apparently measured relative to CPU 0. > It is much lower to CPU 1 since that is on the same core. > > I played with this program a lot 3 and a half years ago, but forgot > mist of what I learned :-(. I tried different fencing in it. This > seems to make little difference when the program is rerun. With the > default TESTS = 1024, the min skew sometimes goes negative on freefall, > but with TESTS = 1024000 that doesn't happen. This is the opposite > of what I would expect. freefall has load average about 1. > > Removing the only fencing in it reduces average latency by 10-20 cycles > and minimum latency by over 100 cycles, except on freefall it is > reduced from 33 to 6. On Haswell it is 24 with fencing and I didn't > test it with no fencing. > > I think tscskew doesn't really measure tsc skew. What it measures is > the time taken for a locking protocol, using the TSCs on different > CPUs to make the start and end timestamps. If the TSCs have a lot of > skew or jitter, then this will show up indirectly as inconsistent and > possibly negative differences. > > A shift of just 1 can't hide latencies of hundreds of cycles on single- > socket machines. Even a shift of 8 only works sometimes, by reducing > the chance of observing the TSC going backwards by a factor of 256. > E.g., assume for simplicity that all instructions and IPCs take 0-1 > cycles, and that unfenced rdtsc's differ by at most +-5 cycles (with > the 11 values between -5 and 5 uniformly distributed. Then with a > shift of 0 and no fences, a CPU that updates the timehands is ahead of > another CPU that spins reading the timehands about 5/11 of the time. > With a shift of 8, the CPUs are close enough when the first one reads > at least 5 above and at least 5 below a 256-boundary. The chance of > seeing a negative difference is reduced by at least a factor of 10/256. > > > I suspect that the shift of 1 (at least) hides cross-socket inaccuracy. > > Otherwise, I think, some multi-socket machines would start showing the > > detectable backward-counting bintime(). At the frequencies at 4GHz and > > above (Intel has 5Ghz part numbers) I do not think that stability of > > 100MHz crystall and on-board traces is enough to avoid that. > > Why would losing just 1 bit fix that? > > Fences for rdtsc of course only serialize it for the CPU that runs it. > The locking (ordering) protocol (for the generation count) orders the > CPUs too. It takes longer than we would like, much more than the 1- > cycle error that might be hidden by ignoring the low bit. Surely the > ordering protocol must work across sockets? It then gives ordering of > rdtsc's. > > TSC-low was added in 2011. That was long before the ordering was fixed. > You added fences in 2012 and memory ordering for the generation count in > 2016. Fences slowed everything down by 10-20+ cycles and probably hide > bugs in the memory ordering better than TSC-low. Memory ordering plus > fences slow down the cross-core case by more than 100 cycles according > to tscskew. That is enough to hide large hardware bugs. > > > We can try to set the tsc-low shift count to 0 (but keep lfence) and see > > what is going on in HEAD, but I am afraid that the HEAD users population > > is not representative enough to catch the issue with the certainity. > > More, it is unclear to me how to diagnose the cause, e.g. I would expect > > the sleeps to hang on timeouts, as was reported from the very beginning > > of this thread. How would we root-cause it ? > > Negative time differences cause lots of overflows so break the timecounter. > The fix under discussion actually gives larger overflows in the positive > direction. E.g., a delta of -1 first overflows to 0xffffffff. The fix > prevents overflow on multiplication by that. When the timecounter > frequency is small, say 1 MHz, 0xffffffff means 4294 seconds, so the > timecounter advances by that. > > >>> amd64 cannot either, but amd64 does not need to mask out top bits in %rax, > >>> since the whole shrdl calculation occurs in 32bit registers, and the result > >>> is in %rax where top word is cleared by shrdl instruction automatically. > >>> But the clearing is not required since result is unsigned int anyway. > >>> > >>> Dissassemble of tsc_get_timecount_low() is very clear: > >>> 0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx > >>> 0xffffffff806767e7 <+7>: rdtsc > >>> 0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax > >>> ... > >>> 0xffffffff806767ed <+13>: retq > >>> (I removed frame manipulations). > > I checked that all compilers still produce horrible code for the better > source code 'return (rdtsc() << (intptr_t)tc->tc_priv);'. 64-bit shifts > are apparently pessimal for compatibility. The above is written mostly > in asm to avoid 2-5 extra instructions. > > >>>> ... > >>>> Similarly in bintime(). > >>> I merged two functions, finally. Having to copy the same code is too > >>> annoying for this change. > > I strongly disklike the merge. > > >>> So I verified that: > >>> - there is no 64bit multiplication in the generated code, for i386 both > >>> for clang 7.0 and gcc 8.3; > >>> - that everything is inlined, the only call from bintime/binuptime is > >>> the indirect call to get the timecounter value. > >> > >> I will have to fix it for compilers that I use. > > Ok, I will add __inline. > > That will make it fast enough, but still hard to read. > > >>> + *bt = *bts; > >>> + scale = th->th_scale; > >>> + delta = tc_delta(th); > >>> +#ifdef _LP64 > >>> + if (__predict_false(th->th_large_delta <= delta)) { > >>> + /* Avoid overflow for scale * delta. */ > >>> + bintime_helper(bt, scale, delta); > >>> + bintime_addx(bt, (scale & 0xffffffff) * delta); > >>> + } else { > >>> + bintime_addx(bt, scale * delta); > >>> + } > >>> +#else > >>> + /* > >>> + * Use bintime_helper() unconditionally, since the fast > >>> + * path in the above method is not so fast here, since > >>> + * the 64 x 32 -> 64 bit multiplication is usually not > >>> + * available in hardware and emulating it using 2 > >>> + * 32 x 32 -> 64 bit multiplications uses code much > >>> + * like that in bintime_helper(). > >>> + */ > >>> + bintime_helper(bt, scale, delta); > >>> + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta); > >>> +#endif > >> > >> Check that this method is really better. Without this, the complicated > >> part is about half as large and duplicating it is smaller than this > >> version. > > Better in what sence ? I am fine with the C code, and asm code looks > > good. > > Better in terms of actually running significantly faster. I fear the > 32-bit method is actually slightly slower for the fast path. > > >>> - do { > >>> - th = timehands; > >>> - gen = atomic_load_acq_int(&th->th_generation); > >>> - *bt = th->th_bintime; > >>> - bintime_addx(bt, th->th_scale * tc_delta(th)); > >>> - atomic_thread_fence_acq(); > >>> - } while (gen == 0 || gen != th->th_generation); > >> > >> Duplicating this loop is much better than obfuscating it using inline > >> functions. This loop was almost duplicated (except for the delta > >> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and > >> 8 fflock ones). Now it is only duplicated 16 times. > > How did you counted the 16 ? I can see only 4 instances in the unpatched > > kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not > > touch ffclock until the patch is finalized. After that, it would be > > 1 instance for kernel and 1 for userspace. > > Grep for the end condition in this loop. There are actually 20 of these. > I'm counting the loops and not the previously-simple scaling operation in > it. The scaling is indeed only done for 4 cases. I prefer the 20 > duplications (except I only want about 6 of the functions). Duplication > works even better for only 4 cases. Ok, I merged these as well. Now there are only four loops left in kernel. I do not think that merging them is beneficial, since they have sufficiently different bodies. I disagree with you characterization of it as obfuscation, IMO it improves the maintainability of the code by reducing number of places which need careful inspection of the lock-less algorithm. > > This should be written as a function call to 1 new function to replace > the line with the overflowing multiplication. The line is always the > same, so the new function call can look like bintime_xxx(bt, th). Again, please provide at least of a pseudocode of your preference. The current patch becomes to large already, I want to test/commit what I already have, and I will need to split it for the commit. diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c index 2656fb4d22f..7114a0e5219 100644 --- a/sys/kern/kern_tc.c +++ b/sys/kern/kern_tc.c @@ -72,6 +72,7 @@ struct timehands { struct timecounter *th_counter; int64_t th_adjustment; uint64_t th_scale; + uint64_t th_large_delta; u_int th_offset_count; struct bintime th_offset; struct bintime th_bintime; @@ -200,22 +201,77 @@ tc_delta(struct timehands *th) * the comment in <sys/time.h> for a description of these 12 functions. */ -#ifdef FFCLOCK -void -fbclock_binuptime(struct bintime *bt) +static __inline void +bintime_helper(struct bintime *bt, uint64_t scale, u_int delta) +{ + uint64_t x; + + x = (scale >> 32) * delta; + bt->sec += x >> 32; + bintime_addx(bt, x << 32); +} + +static __inline void +binnouptime(struct bintime *bt, u_int off) { struct timehands *th; - unsigned int gen; + struct bintime *bts; + uint64_t scale; + u_int delta, gen; do { th = timehands; gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_offset; - bintime_addx(bt, th->th_scale * tc_delta(th)); + bts = (struct bintime *)(vm_offset_t)th + off; + *bt = *bts; + scale = th->th_scale; + delta = tc_delta(th); +#ifdef _LP64 + if (__predict_false(th->th_large_delta <= delta)) { + /* Avoid overflow for scale * delta. */ + bintime_helper(bt, scale, delta); + bintime_addx(bt, (scale & 0xffffffff) * delta); + } else { + bintime_addx(bt, scale * delta); + } +#else + /* + * Use bintime_helper() unconditionally, since the fast + * path in the above method is not so fast here, since + * the 64 x 32 -> 64 bit multiplication is usually not + * available in hardware and emulating it using 2 + * 32 x 32 -> 64 bit multiplications uses code much + * like that in bintime_helper(). + */ + bintime_helper(bt, scale, delta); + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta); +#endif atomic_thread_fence_acq(); } while (gen == 0 || gen != th->th_generation); } +static __inline void +getbinnouptime(void *out, size_t out_size, u_int off) +{ + struct timehands *th; + u_int gen; + + do { + th = timehands; + gen = atomic_load_acq_int(&th->th_generation); + memcpy(out, (char *)th + off, out_size); + atomic_thread_fence_acq(); + } while (gen == 0 || gen != th->th_generation); +} + +#ifdef FFCLOCK +void +fbclock_binuptime(struct bintime *bt) +{ + + binnouptime(bt, __offsetof(struct timehands, th_offset)); +} + void fbclock_nanouptime(struct timespec *tsp) { @@ -237,16 +293,8 @@ fbclock_microuptime(struct timeval *tvp) void fbclock_bintime(struct bintime *bt) { - struct timehands *th; - unsigned int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_bintime; - bintime_addx(bt, th->th_scale * tc_delta(th)); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + binnouptime(bt, __offsetof(struct timehands, th_bintime)); } void @@ -270,100 +318,61 @@ fbclock_microtime(struct timeval *tvp) void fbclock_getbinuptime(struct bintime *bt) { - struct timehands *th; - unsigned int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_offset; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands, + th_offset)); } void fbclock_getnanouptime(struct timespec *tsp) { - struct timehands *th; - unsigned int gen; + struct bintime bt; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - bintime2timespec(&th->th_offset, tsp); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands, + th_offset)); + bintime2timespec(&bt, tsp); } void fbclock_getmicrouptime(struct timeval *tvp) { - struct timehands *th; - unsigned int gen; + struct bintime bt; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - bintime2timeval(&th->th_offset, tvp); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands, + th_offset)); + bintime2timeval(&bt, tvp); } void fbclock_getbintime(struct bintime *bt) { - struct timehands *th; - unsigned int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_bintime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands, + th_bintime)); } void fbclock_getnanotime(struct timespec *tsp) { - struct timehands *th; - unsigned int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *tsp = th->th_nanotime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands, + th_nanotime)); } void fbclock_getmicrotime(struct timeval *tvp) { - struct timehands *th; - unsigned int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *tvp = th->th_microtime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands, + th_microtime)); } #else /* !FFCLOCK */ + void binuptime(struct bintime *bt) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_offset; - bintime_addx(bt, th->th_scale * tc_delta(th)); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + binnouptime(bt, __offsetof(struct timehands, th_offset)); } void @@ -387,16 +396,8 @@ microuptime(struct timeval *tvp) void bintime(struct bintime *bt) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_bintime; - bintime_addx(bt, th->th_scale * tc_delta(th)); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + binnouptime(bt, __offsetof(struct timehands, th_bintime)); } void @@ -420,85 +421,53 @@ microtime(struct timeval *tvp) void getbinuptime(struct bintime *bt) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_offset; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands, + th_offset)); } void getnanouptime(struct timespec *tsp) { - struct timehands *th; - u_int gen; + struct bintime bt; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - bintime2timespec(&th->th_offset, tsp); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands, + th_offset)); + bintime2timespec(&bt, tsp); } void getmicrouptime(struct timeval *tvp) { - struct timehands *th; - u_int gen; + struct bintime bt; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - bintime2timeval(&th->th_offset, tvp); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(&bt, sizeof(bt), __offsetof(struct timehands, + th_offset)); + bintime2timeval(&bt, tvp); } void getbintime(struct bintime *bt) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_bintime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(bt, sizeof(*bt), __offsetof(struct timehands, + th_bintime)); } void getnanotime(struct timespec *tsp) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *tsp = th->th_nanotime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands, + th_nanotime)); } void getmicrotime(struct timeval *tvp) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *tvp = th->th_microtime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(tvp, sizeof(*tvp), __offsetof(struct timehands, + th_microtime)); } #endif /* FFCLOCK */ @@ -514,15 +483,9 @@ getboottime(struct timeval *boottime) void getboottimebin(struct bintime *boottimebin) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *boottimebin = th->th_boottime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(boottimebin, sizeof(*boottimebin), + __offsetof(struct timehands, th_boottime)); } #ifdef FFCLOCK @@ -1038,15 +1001,9 @@ getmicrotime(struct timeval *tvp) void dtrace_getnanotime(struct timespec *tsp) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *tsp = th->th_nanotime; - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + getbinnouptime(tsp, sizeof(*tsp), __offsetof(struct timehands, + th_nanotime)); } /* @@ -1464,6 +1421,7 @@ tc_windup(struct bintime *new_boottimebin) scale += (th->th_adjustment / 1024) * 2199; scale /= th->th_counter->tc_frequency; th->th_scale = scale * 2; + th->th_large_delta = ((uint64_t)1 << 63) / scale; /* * Now that the struct timehands is again consistent, set the new
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190306172003.GD2492>