From owner-freebsd-ppc@freebsd.org Sun Mar 3 16:16:47 2019 Return-Path: Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1BDCC150FB82; Sun, 3 Mar 2019 16:16:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 41FCD8A1A6; Sun, 3 Mar 2019 16:16:46 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id x23GGaML078609 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 3 Mar 2019 18:16:39 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua x23GGaML078609 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id x23GGZF2078608; Sun, 3 Mar 2019 18:16:35 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 3 Mar 2019 18:16:35 +0200 From: Konstantin Belousov To: Bruce Evans Cc: Mark Millard , freebsd-hackers Hackers , FreeBSD PowerPC ML Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190303161635.GJ68879@kib.kiev.ua> References: <20190301112717.GW2420@kib.kiev.ua> <20190302043936.A4444@besplex.bde.org> <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190303223100.B3572@besplex.bde.org> User-Agent: Mutt/1.11.3 (2019-02-01) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-BeenThere: freebsd-ppc@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the PowerPC List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Mar 2019 16:16:47 -0000 On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote: > On Sun, 3 Mar 2019, Konstantin Belousov wrote: > > > On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote: > >> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >> > >>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote: > >>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >* ... > >>>> I don't changing this at all this. binuptime() was carefully written > >>>> to not need so much 64-bit arithmetic. > >>>> > >>>> If this pessimization is allowed, then it can also handle a 64-bit > >>>> deltas. Using the better kernel method: > >>>> > >>>> if (__predict_false(delta >= th->th_large_delta)) { > >>>> bt->sec += (scale >> 32) * (delta >> 32); > >>>> x = (scale >> 32) * (delta & 0xffffffff); > >>>> bt->sec += x >> 32; > >>>> bintime_addx(bt, x << 32); > >>>> x = (scale & 0xffffffff) * (delta >> 32); > >>>> bt->sec += x >> 32; > >>>> bintime_addx(bt, x << 32); > >>>> bintime_addx(bt, (scale & 0xffffffff) * > >>>> (delta & 0xffffffff)); > >>>> } else > >>>> bintime_addx(bt, scale * (delta & 0xffffffff)); > >>> This only makes sense if delta is extended to uint64_t, which requires > >>> the pass over timecounters. > >> > >> Yes, that was its point. It is a bit annoying to have a hardware > >> timecounter like the TSC that doesn't wrap naturally, but then make it > >> wrap by masking high bits. > >> > >> The masking step is also a bit wasteful. For the TSC, it is 1 step to > >> discard high bids at the register level, then another step to apply the > >> nask to discard th high bits again. > > rdtsc-low is implemented in the natural way, after RDTSC, no register > > combining into 64bit value is done, instead shrd operates on %edx:%eax > > to get the final result into %eax. I am not sure what you refer to. > > I was referring mostly to the masking step '& tc->tc_counter_mask' and > the lack of register combining in rdtsc(). > > However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining > step. i386 used to be faster here -- the first masking step of discarding > %edx doesn't take any code. amd64 has to mask out the top bits in %rax. > Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64 > has to do a not so slow shr. i386 cannot discard %edx after RDTSC since some bits from %edx come into the timecounter value. amd64 cannot either, but amd64 does not need to mask out top bits in %rax, since the whole shrdl calculation occurs in 32bit registers, and the result is in %rax where top word is cleared by shrdl instruction automatically. But the clearing is not required since result is unsigned int anyway. Dissassemble of tsc_get_timecount_low() is very clear: 0xffffffff806767e4 <+4>: mov 0x30(%rdi),%ecx 0xffffffff806767e7 <+7>: rdtsc 0xffffffff806767e9 <+9>: shrd %cl,%edx,%eax ... 0xffffffff806767ed <+13>: retq (I removed frame manipulations). > > Then the '& tc->tc_counter_mask' step has no effect. This is true. > > All this is wrapped in many layers of function calls which are quite slow > but this lets the other operations run in parallel on some CPUs. > > >>>> /* 32-bit arches did the next multiplication implicitly. */ > >>>> x = (scale >> 32) * delta; > >>>> /* > >>>> * And they did the following shifts and most of the adds > >>>> * implicitly too. Except shifting x left by 32 lost the > >>>> * seconds part that the next line handles. The next line > >>>> * is the only extra cost for them. > >>>> */ > >>>> bt->sec += x >> 32; > >>>> bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta); > >>> > >>> Ok, what about the following. > >> > >> I'm not sure that I really want this, even if the pessimization is done. > >> But it avoids using fls*(), so is especially good for 32-bit systems and > >> OK for 64-bit systems too, especially in userland where fls*() is in the > >> fast path. > > For userland I looked at the generated code, and BSR usage seems to be > > good enough, for default compilation settings with clang. > > I use gcc-4.2.1, and it doesn't do this optimization. > > I already reported this in connection with fixing calcru1(). calcru1() > is unnecessarily several times slower on i386 than on amd64 even after > avoiding using flsll() on it. The main slowness is in converting 'usec' > to tv_sec and tv_usec, due to the bad design and implementation of the > __udivdi3 and __umoddi3 libcalls. The bad design is having to make 2 > libcalls to get the quotient and remainder. The bad implementation is > the portable C version in libkern. libgcc provides a better implementation, > but this is not available in the kernel. > > >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > >>> index 2656fb4d22f..2e28f872229 100644 > >>> --- a/sys/kern/kern_tc.c > >>> +++ b/sys/kern/kern_tc.c > >>> ... > >>> @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp) > >>> } while (gen == 0 || gen != th->th_generation); > >>> } > >>> #else /* !FFCLOCK */ > >>> + > >>> +static void > >>> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta) > >>> +{ > >>> + uint64_t x; > >>> + > >>> + x = (*scale >> 32) * delta; > >>> + *scale &= 0xffffffff; > >>> + bt->sec += x >> 32; > >>> + bintime_addx(bt, x << 32); > >>> +} > >> > >> It is probably best to not inline the slow path, but clang tends to > >> inline everything anyway. > > It does not matter if it inlines it, as far as it is moved out of the > > linear sequence for the fast path. > >> > >> I prefer my way of writing this in 3 lines. Modifying 'scale' for > >> the next step is especially ugly and pessimal when the next step is > >> in the caller and this function is not inlined. > > Can you show exactly what do you want ? > > Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers, > and don't pass 'scale' indirectly to bintime_helper() and don't modify > it there. > > Oops, there is a problem. 'scale' must be reduced iff bintime_helper() > was used. Duplicate some source code so as to not need a fall-through > to the fast path. See below. Yes, this is the reason why it is passed by pointer (C has no references). > > >>> void > >>> binuptime(struct bintime *bt) > >>> { > >>> struct timehands *th; > >>> - u_int gen; > >>> + uint64_t scale; > >>> + u_int delta, gen; > >>> > >>> do { > >>> th = timehands; > >>> gen = atomic_load_acq_int(&th->th_generation); > >>> *bt = th->th_offset; > >>> - bintime_addx(bt, th->th_scale * tc_delta(th)); > >>> + scale = th->th_scale; > >>> + delta = tc_delta(th); > >>> +#ifdef _LP64 > >>> + /* Avoid overflow for scale * delta. */ > >>> + if (__predict_false(th->th_large_delta <= delta)) > >>> + bintime_helper(bt, &scale, delta); > >>> + bintime_addx(bt, scale * delta); > >>> +#else > >>> + /* > >>> + * Also avoid (uint64_t, uint32_t) -> uint64_t > >>> + * multiplication on 32bit arches. > >>> + */ > >> > >> "Also avoid overflow for ..." > >> > >>> + bintime_helper(bt, &scale, delta); > >>> + bintime_addx(bt, (u_int)scale * delta); > >> > >> The cast should be to uint32_t, but better write it as & 0xffffffff as > >> elsewhere. > > This is actually very broken. The cast gives a 32 x 32 -> 32 bit > multiplication, but all 64 bits of the result are needed. Yes, fixed in the updated version. > > >> > >> bintime_helper() already reduced 'scale' to 32 bits. The cast might be > >> needed to tell the compiler this, especially when the function is not > >> inlined. Better not do it in the function. The function doesn't even > >> use the reduced value. > > I used cast to use 32x32 multiplication. I am not sure that all (or any) > > compilers are smart enough to deduce that they can use 32 bit mul. > > Writing the reduction to 32 bits using a mask instead of a cast automatically > avoids the bug, but might not give the optimization. > > They do do this optimization, but might need the cast as well as the mask. > At worst, '(uint64_t)(uint32_t)(scale & 0xffffffff)', where the mask is > now redundant but the cast back to 64 bits is needed if the cast to 32 > bits is used. > > You already depended on them not needing the cast for the expression > '(*scale >> 32) * delta'. Here delta is 32 bits and the other operand > must remain 64 bits so that after default promotions the multiplication > is 64 x 64 -> 64 bits, but the compiler should optimize this to > 32 x 32 -> 64 bits. (*scale >> 32) would need to be cast to 32 bits > and then back to 64 bits if the compiler can't do this automatically. > > I checked what some compilers do. Both gcc-3.3.3 and gcc-4.2.1 > optimize only (uint64_t)x * y (where x and y have type uint32_t), so they > need to be helped by casts if x and y have have a larger type even if > their values obviously fit in 32 bits. So the expressions should be > written as: > > (uint64_t)(uint32_t)(scale >> 32) * delta; > > and > > (uint64_t)(uint32_t)scale * delta; > > The 2 casts are always needed, but the '& 0xffffffff' operation doesn't > need to be explicit because the cast does. This is what I do now. > > >> This needs lots of testing of course. > > > > Current kernel-only part of the change is below, see the question about > > your preference for binuptime_helper(). > > > > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > > index 2656fb4d22f..6c41ab22288 100644 > > --- a/sys/kern/kern_tc.c > > +++ b/sys/kern/kern_tc.c > > @@ -72,6 +71,7 @@ struct timehands { > > struct timecounter *th_counter; > > int64_t th_adjustment; > > uint64_t th_scale; > > + uint64_t th_large_delta; > > u_int th_offset_count; > > struct bintime th_offset; > > struct bintime th_bintime; > > @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp) > > } while (gen == 0 || gen != th->th_generation); > > } > > #else /* !FFCLOCK */ > > + > > +static void > > Add __inline. This is in the fast path for 32-bit systems. Compilers do not need this hand-holding, and I prefer to avoid __inline unless really necessary. I checked with both clang 7.0 and gcc 8.3 that autoinlining did occured. > > > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta) > > +{ > > + uint64_t x; > > + > > + x = (*scale >> 32) * delta; > > + *scale &= 0xffffffff; > > Remove the '*' on scale, cast (scale >> 32) to > (uint64_t)(uint32_t)(scale >> 32), and remove the change to *scale. > > > + bt->sec += x >> 32; > > + bintime_addx(bt, x << 32); > > +} > > + > > void > > binuptime(struct bintime *bt) > > { > > struct timehands *th; > > - u_int gen; > > + uint64_t scale; > > + u_int delta, gen; > > > > do { > > th = timehands; > > gen = atomic_load_acq_int(&th->th_generation); > > *bt = th->th_offset; > > - bintime_addx(bt, th->th_scale * tc_delta(th)); > > + scale = th->th_scale; > > + delta = tc_delta(th); > > +#ifdef _LP64 > > + /* Avoid overflow for scale * delta. */ > > + if (__predict_false(th->th_large_delta <= delta)) > > + bintime_helper(bt, &scale, delta); > > + bintime_addx(bt, scale * delta); > > Change to: > > if (__predict_false(th->th_large_delta <= delta)) { > bintime_helper(bt, scale, delta); > bintime_addx(bt, (scale & 0xffffffff) * delta); > } else > bintime_addx(bt, scale * delta); I do not like it, but ok. > > > +#else > > + /* > > + * Avoid both overflow as above and > > + * (uint64_t, uint32_t) -> uint64_t > > + * multiplication on 32bit arches. > > + */ > > This is a bit unclear. Better emphasize avoidance of the 64 x 32 -> 64 bit > multiplication. Something like: > > /* > * Use bintime_helper() unconditionally, since the fast > * path in the above method is not so fast here, since > * the 64 x 32 -> 64 bit multiplication is usually not > * available in hardware and emulating it using 2 > * 32 x 32 -> 64 bit multiplications uses code much > * like that in bintime_helper(). > */ > > > + bintime_helper(bt, &scale, delta); > > + bintime_addx(bt, (uint32_t)scale * delta); > > +#endif > > Remove '&' as usual, and fix this by casting the reduced scale back to > 64 bits. > > Similarly in bintime(). I merged two functions, finally. Having to copy the same code is too annoying for this change. So I verified that: - there is no 64bit multiplication in the generated code, for i386 both for clang 7.0 and gcc 8.3; - that everything is inlined, the only call from bintime/binuptime is the indirect call to get the timecounter value. > > Similarly in libc -- don't use the slow flsll() method in the 32-bit > case where it is especially slow. Don't use it in the 64-bit case either, > since this would need to be change when th_large_delta is added to the > API. > > Now I don't like my method in the kernel. It is is unnecessarily > complicated to have a specal case, and not faster either. diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c index 2656fb4d22f..0fd39e25058 100644 --- a/sys/kern/kern_tc.c +++ b/sys/kern/kern_tc.c @@ -72,6 +72,7 @@ struct timehands { struct timecounter *th_counter; int64_t th_adjustment; uint64_t th_scale; + uint64_t th_large_delta; u_int th_offset_count; struct bintime th_offset; struct bintime th_bintime; @@ -351,21 +352,63 @@ fbclock_getmicrotime(struct timeval *tvp) } while (gen == 0 || gen != th->th_generation); } #else /* !FFCLOCK */ -void -binuptime(struct bintime *bt) + +static void +bintime_helper(struct bintime *bt, uint64_t scale, u_int delta) +{ + uint64_t x; + + x = (scale >> 32) * delta; + bt->sec += x >> 32; + bintime_addx(bt, x << 32); +} + +static void +binnouptime(struct bintime *bt, u_int off) { struct timehands *th; - u_int gen; + struct bintime *bts; + uint64_t scale; + u_int delta, gen; do { th = timehands; gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_offset; - bintime_addx(bt, th->th_scale * tc_delta(th)); + bts = (struct bintime *)(vm_offset_t)th + off; + *bt = *bts; + scale = th->th_scale; + delta = tc_delta(th); +#ifdef _LP64 + if (__predict_false(th->th_large_delta <= delta)) { + /* Avoid overflow for scale * delta. */ + bintime_helper(bt, scale, delta); + bintime_addx(bt, (scale & 0xffffffff) * delta); + } else { + bintime_addx(bt, scale * delta); + } +#else + /* + * Use bintime_helper() unconditionally, since the fast + * path in the above method is not so fast here, since + * the 64 x 32 -> 64 bit multiplication is usually not + * available in hardware and emulating it using 2 + * 32 x 32 -> 64 bit multiplications uses code much + * like that in bintime_helper(). + */ + bintime_helper(bt, scale, delta); + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta); +#endif atomic_thread_fence_acq(); } while (gen == 0 || gen != th->th_generation); } +void +binuptime(struct bintime *bt) +{ + + binnouptime(bt, __offsetof(struct timehands, th_offset)); +} + void nanouptime(struct timespec *tsp) { @@ -387,16 +430,8 @@ microuptime(struct timeval *tvp) void bintime(struct bintime *bt) { - struct timehands *th; - u_int gen; - do { - th = timehands; - gen = atomic_load_acq_int(&th->th_generation); - *bt = th->th_bintime; - bintime_addx(bt, th->th_scale * tc_delta(th)); - atomic_thread_fence_acq(); - } while (gen == 0 || gen != th->th_generation); + binnouptime(bt, __offsetof(struct timehands, th_bintime)); } void @@ -1464,6 +1499,7 @@ tc_windup(struct bintime *new_boottimebin) scale += (th->th_adjustment / 1024) * 2199; scale /= th->th_counter->tc_frequency; th->th_scale = scale * 2; + th->th_large_delta = ((uint64_t)1 << 63) / scale; /* * Now that the struct timehands is again consistent, set the new