From owner-freebsd-hackers@freebsd.org Sun Mar 3 11:19:45 2019 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9A6E11506BBC; Sun, 3 Mar 2019 11:19:45 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 02A5180B42; Sun, 3 Mar 2019 11:19:44 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id x23BJWMX054208 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Sun, 3 Mar 2019 13:19:36 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua x23BJWMX054208 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id x23BJVXN054206; Sun, 3 Mar 2019 13:19:31 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sun, 3 Mar 2019 13:19:31 +0200 From: Konstantin Belousov To: Bruce Evans Cc: Mark Millard , freebsd-hackers Hackers , FreeBSD PowerPC ML Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190303111931.GI68879@kib.kiev.ua> References: <962D78C3-65BE-40C1-BB50-A0088223C17B@yahoo.com> <28C2BB0A-3DAA-4D18-A317-49A8DD52778F@yahoo.com> <20190301112717.GW2420@kib.kiev.ua> <20190302043936.A4444@besplex.bde.org> <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190303041441.V4781@besplex.bde.org> User-Agent: Mutt/1.11.3 (2019-02-01) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Mar 2019 11:19:45 -0000 On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote: > On Sat, 2 Mar 2019, Konstantin Belousov wrote: > > > On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote: > >> On Sat, 2 Mar 2019, Konstantin Belousov wrote: > >>> ... > >>> So I am able to reproduce it with some surprising ease on HPET running > >>> on Haswell. > >> > >> So what is the cause of it? Maybe the tickless code doesn't generate > >> fake clock ticks right. Or it is just a library bug. The kernel has > >> to be slightly real-time to satisfy the requirement of 1 update per. > >> Applications are further from being real-time. But isn't it enough > >> for the kernel to ensure that the timehands cycle more than once per > >> second? > > No, I entered ddb as you suggested. > > But using ddb is not normal. It is convenient that this fixes HPET and > ACPI timecounters after using ddb, but this method doesn't help for > timecounters that wrap fast. TSC-low at 2GHz wraps in 2 seconds, and > i8254 wraps in a few milliseconds. > > >> I don't changing this at all this. binuptime() was carefully written > >> to not need so much 64-bit arithmetic. > >> > >> If this pessimization is allowed, then it can also handle a 64-bit > >> deltas. Using the better kernel method: > >> > >> if (__predict_false(delta >= th->th_large_delta)) { > >> bt->sec += (scale >> 32) * (delta >> 32); > >> x = (scale >> 32) * (delta & 0xffffffff); > >> bt->sec += x >> 32; > >> bintime_addx(bt, x << 32); > >> x = (scale & 0xffffffff) * (delta >> 32); > >> bt->sec += x >> 32; > >> bintime_addx(bt, x << 32); > >> bintime_addx(bt, (scale & 0xffffffff) * > >> (delta & 0xffffffff)); > >> } else > >> bintime_addx(bt, scale * (delta & 0xffffffff)); > > This only makes sense if delta is extended to uint64_t, which requires > > the pass over timecounters. > > Yes, that was its point. It is a bit annoying to have a hardware > timecounter like the TSC that doesn't wrap naturally, but then make it > wrap by masking high bits. > > The masking step is also a bit wasteful. For the TSC, it is 1 step to > discard high bids at the register level, then another step to apply the > nask to discard th high bits again. rdtsc-low is implemented in the natural way, after RDTSC, no register combining into 64bit value is done, instead shrd operates on %edx:%eax to get the final result into %eax. I am not sure what you refer to. > > >> I just noticed that there is a 64 x 32 -> 64 bit multiplication in the > >> current method. This can be changed to do expicit 32 x 32 -> 64 bit > >> multiplications and fix the overflow problem at small extra cost on > >> 32-bit arches: > >> > >> /* 32-bit arches did the next multiplication implicitly. */ > >> x = (scale >> 32) * delta; > >> /* > >> * And they did the following shifts and most of the adds > >> * implicitly too. Except shifting x left by 32 lost the > >> * seconds part that the next line handles. The next line > >> * is the only extra cost for them. > >> */ > >> bt->sec += x >> 32; > >> bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta); > > > > Ok, what about the following. > > I'm not sure that I really want this, even if the pessimization is done. > But it avoids using fls*(), so is especially good for 32-bit systems and > OK for 64-bit systems too, especially in userland where fls*() is in the > fast path. For userland I looked at the generated code, and BSR usage seems to be good enough, for default compilation settings with clang. > > > > > diff --git a/lib/libc/sys/__vdso_gettimeofday.c b/lib/libc/sys/__vdso_gettimeofday.c > > index 3749e0473af..cfe3d96d001 100644 > > --- a/lib/libc/sys/__vdso_gettimeofday.c > > +++ b/lib/libc/sys/__vdso_gettimeofday.c > > @@ -32,6 +32,8 @@ __FBSDID("$FreeBSD$"); > > #include > > #include > > #include > > +#include > > Not needed with 0xffffffff instead of UINT_MAX. > > The userland part is otherwise little changed. Yes, see above. If ABI for shared page going to be changed in some future, I will export th_large_delta as well. > > > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > > index 2656fb4d22f..2e28f872229 100644 > > --- a/sys/kern/kern_tc.c > > +++ b/sys/kern/kern_tc.c > > ... > > @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp) > > } while (gen == 0 || gen != th->th_generation); > > } > > #else /* !FFCLOCK */ > > + > > +static void > > +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta) > > +{ > > + uint64_t x; > > + > > + x = (*scale >> 32) * delta; > > + *scale &= 0xffffffff; > > + bt->sec += x >> 32; > > + bintime_addx(bt, x << 32); > > +} > > It is probably best to not inline the slow path, but clang tends to > inline everything anyway. It does not matter if it inlines it, as far as it is moved out of the linear sequence for the fast path. > > I prefer my way of writing this in 3 lines. Modifying 'scale' for > the next step is especially ugly and pessimal when the next step is > in the caller and this function is not inlined. Can you show exactly what do you want ? > > > + > > void > > binuptime(struct bintime *bt) > > { > > struct timehands *th; > > - u_int gen; > > + uint64_t scale; > > + u_int delta, gen; > > > > do { > > th = timehands; > > gen = atomic_load_acq_int(&th->th_generation); > > *bt = th->th_offset; > > - bintime_addx(bt, th->th_scale * tc_delta(th)); > > + scale = th->th_scale; > > + delta = tc_delta(th); > > +#ifdef _LP64 > > + /* Avoid overflow for scale * delta. */ > > + if (__predict_false(th->th_large_delta <= delta)) > > + bintime_helper(bt, &scale, delta); > > + bintime_addx(bt, scale * delta); > > +#else > > + /* > > + * Also avoid (uint64_t, uint32_t) -> uint64_t > > + * multiplication on 32bit arches. > > + */ > > "Also avoid overflow for ..." > > > + bintime_helper(bt, &scale, delta); > > + bintime_addx(bt, (u_int)scale * delta); > > The cast should be to uint32_t, but better write it as & 0xffffffff as > elsewhere. > > bintime_helper() already reduced 'scale' to 32 bits. The cast might be > needed to tell the compiler this, especially when the function is not > inlined. Better not do it in the function. The function doesn't even > use the reduced value. I used cast to use 32x32 multiplication. I am not sure that all (or any) compilers are smart enough to deduce that they can use 32 bit mul. > > bintime_helper() is in the fast path in this case, so should be inlined. > > > +#endif > > atomic_thread_fence_acq(); > > } while (gen == 0 || gen != th->th_generation); > > } > > This needs lots of testing of course. Current kernel-only part of the change is below, see the question about your preference for binuptime_helper(). diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c index 2656fb4d22f..6c41ab22288 100644 --- a/sys/kern/kern_tc.c +++ b/sys/kern/kern_tc.c @@ -72,6 +71,7 @@ struct timehands { struct timecounter *th_counter; int64_t th_adjustment; uint64_t th_scale; + uint64_t th_large_delta; u_int th_offset_count; struct bintime th_offset; struct bintime th_bintime; @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp) } while (gen == 0 || gen != th->th_generation); } #else /* !FFCLOCK */ + +static void +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta) +{ + uint64_t x; + + x = (*scale >> 32) * delta; + *scale &= 0xffffffff; + bt->sec += x >> 32; + bintime_addx(bt, x << 32); +} + void binuptime(struct bintime *bt) { struct timehands *th; - u_int gen; + uint64_t scale; + u_int delta, gen; do { th = timehands; gen = atomic_load_acq_int(&th->th_generation); *bt = th->th_offset; - bintime_addx(bt, th->th_scale * tc_delta(th)); + scale = th->th_scale; + delta = tc_delta(th); +#ifdef _LP64 + /* Avoid overflow for scale * delta. */ + if (__predict_false(th->th_large_delta <= delta)) + bintime_helper(bt, &scale, delta); + bintime_addx(bt, scale * delta); +#else + /* + * Avoid both overflow as above and + * (uint64_t, uint32_t) -> uint64_t + * multiplication on 32bit arches. + */ + bintime_helper(bt, &scale, delta); + bintime_addx(bt, (uint32_t)scale * delta); +#endif atomic_thread_fence_acq(); } while (gen == 0 || gen != th->th_generation); } @@ -388,13 +416,29 @@ void bintime(struct bintime *bt) { struct timehands *th; - u_int gen; + uint64_t scale; + u_int delta, gen; do { th = timehands; gen = atomic_load_acq_int(&th->th_generation); *bt = th->th_bintime; - bintime_addx(bt, th->th_scale * tc_delta(th)); + scale = th->th_scale; + delta = tc_delta(th); +#ifdef _LP64 + /* Avoid overflow for scale * delta. */ + if (__predict_false(th->th_large_delta <= delta)) + bintime_helper(bt, &scale, delta); + bintime_addx(bt, scale * delta); +#else + /* + * Avoid both overflow as above and + * (uint64_t, uint32_t) -> uint64_t + * multiplication on 32bit arches. + */ + bintime_helper(bt, &scale, delta); + bintime_addx(bt, (uint32_t)scale * delta); +#endif atomic_thread_fence_acq(); } while (gen == 0 || gen != th->th_generation); } @@ -1464,6 +1508,7 @@ tc_windup(struct bintime *new_boottimebin) scale += (th->th_adjustment / 1024) * 2199; scale /= th->th_counter->tc_frequency; th->th_scale = scale * 2; + th->th_large_delta = ((uint64_t)1 << 63) / scale; /* * Now that the struct timehands is again consistent, set the new