Date: Mon, 4 Mar 2019 13:41:50 +0200 From: Konstantin Belousov <kostikbel@gmail.com> To: Bruce Evans <brde@optusnet.com.au> Cc: Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190304114150.GM68879@kib.kiev.ua> In-Reply-To: <20190304043416.V5640@besplex.bde.org> References: <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote: > On Sun, 3 Mar 2019, Konstantin Belousov wrote: > > > On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote: > >> On Sun, 3 Mar 2019, Konstantin Belousov wrote: > >> > >>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote: > >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >>>> > >>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote: > >>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > > * ... > >>>> Yes, that was its point. It is a bit annoying to have a hardware > >>>> timecounter like the TSC that doesn't wrap naturally, but then make it > >>>> wrap by masking high bits. > >>>> > >>>> The masking step is also a bit wasteful. For the TSC, it is 1 step to > >>>> discard high bids at the register level, then another step to apply the > >>>> nask to discard th high bits again. > >>> rdtsc-low is implemented in the natural way, after RDTSC, no register > >>> combining into 64bit value is done, instead shrd operates on %edx:%eax > >>> to get the final result into %eax. I am not sure what you refer to. > >> > >> I was referring mostly to the masking step '& tc->tc_counter_mask' and > >> the lack of register combining in rdtsc(). > >> > >> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining > >> step. i386 used to be faster here -- the first masking step of discarding > >> %edx doesn't take any code. amd64 has to mask out the top bits in %rax. > >> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64 > >> has to do a not so slow shr. > > i386 cannot discard %edx after RDTSC since some bits from %edx come into > > the timecounter value. > > These bits are part of the tsc-low pessimization. The shift count should > always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX > sometimes. > > When tsc-low was new, the shift count was often larger (as much as 8), > and it is still changeable by a read-only tunable, but now it is 1 in > almost all cases. The code only limits the timecounter frequency > to UINT_MAX, except the tunable defaults to 1 so average CPUs running > at nearly 4 GHz are usually limited to about 2 GHz. The comment about > this UINT_MAX doesn't match the code. The comment says int, but the > code says UINT. > > All that a shoft count of 1 does is waste time to lose 1 bit of accuracy. > This much accuracy is noise for most purposes. > > The tunable is fairly undocumented. Its description is "Shift to apply > for the maximum TSC frequency". Of course, it has no effect on the TSC > frequency. It only affects the TSC timecounter frequency. I suspect that the shift of 1 (at least) hides cross-socket inaccuracy. Otherwise, I think, some multi-socket machines would start showing the detectable backward-counting bintime(). At the frequencies at 4GHz and above (Intel has 5Ghz part numbers) I do not think that stability of 100MHz crystall and on-board traces is enough to avoid that. We can try to set the tsc-low shift count to 0 (but keep lfence) and see what is going on in HEAD, but I am afraid that the HEAD users population is not representative enough to catch the issue with the certainity. More, it is unclear to me how to diagnose the cause, e.g. I would expect the sleeps to hang on timeouts, as was reported from the very beginning of this thread. How would we root-cause it ? > > The cputicker normally uses the TSC without even an lfence. This use > only has to be monotonic per-CPU, so this is OK. Also, any bugs hidden > by discarding low bits shouldn't show up per-CPU. However, keeping > the cputicker below 4G actually has some efficiency advantages. For > timecounters, there are no multiplications or divisions by the frequency > in the fast path, but cputicker use isn't so optimized and it does a > slow 64-bit division in cputick2usec(). Keeping cpu_tick_freqency > below UINT_MAX allows dividing by it in integer arithmetic in some cases, > This optimization is not done. > > > amd64 cannot either, but amd64 does not need to mask out top bits in %rax, > > since the whole shrdl calculation occurs in 32bit registers, and the result > > is in %rax where top word is cleared by shrdl instruction automatically. > > But the clearing is not required since result is unsigned int anyway. > > > > Dissassemble of tsc_get_timecount_low() is very clear: > > 0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx > > 0xffffffff806767e7 <+7>: rdtsc > > 0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax > > ... > > 0xffffffff806767ed <+13>: retq > > (I removed frame manipulations). > > It would without the shift pessimization, since the function returns uint32_t > but rdtsc() gives uint64_t. Removing the top bits is not needed since > tc_delta() removes them again, but the API doesn't allow expressing this. > > Without the shift pessimization, we just do rdtsc() in all cases and don't > need this function call. I think this is about 5-10 cycles faster after > some parallelism. > > >>>> I prefer my way of writing this in 3 lines. Modifying 'scale' for > >>>> the next step is especially ugly and pessimal when the next step is > >>>> in the caller and this function is not inlined. > >>> Can you show exactly what do you want ? > >> > >> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers, > >> and don't pass 'scale' indirectly to bintime_helper() and don't modify > >> it there. > >> > >> Oops, there is a problem. 'scale' must be reduced iff bintime_helper() > >> was used. Duplicate some source code so as to not need a fall-through > >> to the fast path. See below. > > Yes, this is the reason why it is passed by pointer (C has no references). > > The indirection is slow no matter how it is spelled, unless it is inlined > away. > > >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > >>> index 2656fb4d22f..6c41ab22288 100644 > >>> --- a/sys/kern/kern_tc.c > >>> +++ b/sys/kern/kern_tc.c > >>> @@ -72,6 +71,7 @@ struct timehands { > >>> struct timecounter *th_counter; > >>> int64_t th_adjustment; > >>> uint64_t th_scale; > >>> + uint64_t th_large_delta; > >>> u_int th_offset_count; > >>> struct bintime th_offset; > >>> struct bintime th_bintime; > >>> @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp) > >>> } while (gen == 0 || gen != th->th_generation); > >>> } > >>> #else /* !FFCLOCK */ > >>> + > >>> +static void > >> > >> Add __inline. This is in the fast path for 32-bit systems. > > Compilers do not need this hand-holding, and I prefer to avoid __inline > > unless really necessary. I checked with both clang 7.0 and gcc 8.3 > > that autoinlining did occured. > > But they do. I don't use either of these compilers, and turn of inlining > as much as possible anyway using -fno-inline -fno-inline-functions-called- > once (this is very broken in clang -- -fno-inline turns off inlining of > even functions declared as __inline (like curthread), and clang doesn't > support -fno-inline -fno-inline-functions-called-once. > > >> ... > >> Similarly in bintime(). > > I merged two functions, finally. Having to copy the same code is too > > annoying for this change. > > > > So I verified that: > > - there is no 64bit multiplication in the generated code, for i386 both > > for clang 7.0 and gcc 8.3; > > - that everything is inlined, the only call from bintime/binuptime is > > the indirect call to get the timecounter value. > > I will have to fix it for compilers that I use. Ok, I will add __inline. > > > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > > index 2656fb4d22f..0fd39e25058 100644 > > --- a/sys/kern/kern_tc.c > > +++ b/sys/kern/kern_tc.c > + ... > > +static void > > +binnouptime(struct bintime *bt, u_int off) > > { > > struct timehands *th; > > - u_int gen; > > + struct bintime *bts; > > + uint64_t scale; > > + u_int delta, gen; > > > > do { > > th = timehands; > > gen = atomic_load_acq_int(&th->th_generation); > > - *bt = th->th_offset; > > - bintime_addx(bt, th->th_scale * tc_delta(th)); > > + bts = (struct bintime *)(vm_offset_t)th + off; > > I don't like the merging. It obscures the code with conversions like this. > > > + *bt = *bts; > > + scale = th->th_scale; > > + delta = tc_delta(th); > > +#ifdef _LP64 > > + if (__predict_false(th->th_large_delta <= delta)) { > > + /* Avoid overflow for scale * delta. */ > > + bintime_helper(bt, scale, delta); > > + bintime_addx(bt, (scale & 0xffffffff) * delta); > > + } else { > > + bintime_addx(bt, scale * delta); > > + } > > +#else > > + /* > > + * Use bintime_helper() unconditionally, since the fast > > + * path in the above method is not so fast here, since > > + * the 64 x 32 -> 64 bit multiplication is usually not > > + * available in hardware and emulating it using 2 > > + * 32 x 32 -> 64 bit multiplications uses code much > > + * like that in bintime_helper(). > > + */ > > + bintime_helper(bt, scale, delta); > > + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta); > > +#endif > > Check that this method is really better. Without this, the complicated > part is about half as large and duplicating it is smaller than this > version. Better in what sence ? I am fine with the C code, and asm code looks good. > > > @@ -387,16 +430,8 @@ microuptime(struct timeval *tvp) > > void > > bintime(struct bintime *bt) > > { > > - struct timehands *th; > > - u_int gen; > > > > - do { > > - th = timehands; > > - gen = atomic_load_acq_int(&th->th_generation); > > - *bt = th->th_bintime; > > - bintime_addx(bt, th->th_scale * tc_delta(th)); > > - atomic_thread_fence_acq(); > > - } while (gen == 0 || gen != th->th_generation); > > Duplicating this loop is much better than obfuscating it using inline > functions. This loop was almost duplicated (except for the delta > calculation) in no less than 17 functions in kern_tc.c (9 tc ones and > 8 fflock ones). Now it is only duplicated 16 times. How did you counted the 16 ? I can see only 4 instances in the unpatched kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not touch ffclock until the patch is finalized. After that, it would be 1 instance for kernel and 1 for userspace. > > > + binnouptime(bt, __offsetof(struct timehands, th_bintime)); > > } > > > > void > > Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190304114150.GM68879>