Date: Mon, 4 Mar 2019 05:29:48 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190304043416.V5640@besplex.bde.org> In-Reply-To: <20190303161635.GJ68879@kib.kiev.ua> References: <20190301112717.GW2420@kib.kiev.ua> <20190302043936.A4444@besplex.bde.org> <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 3 Mar 2019, Konstantin Belousov wrote:
> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>>>>
>>>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
>>>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
> * ...
>>>> Yes, that was its point.  It is a bit annoying to have a hardware
>>>> timecounter like the TSC that doesn't wrap naturally, but then make it
>>>> wrap by masking high bits.
>>>>
>>>> The masking step is also a bit wasteful.  For the TSC, it is 1 step to
>>>> discard high bids at the register level, then another step to apply the
>>>> nask to discard th high bits again.
>>> rdtsc-low is implemented in the natural way, after RDTSC, no register
>>> combining into 64bit value is done, instead shrd operates on %edx:%eax
>>> to get the final result into %eax.  I am not sure what you refer to.
>>
>> I was referring mostly to the masking step '& tc->tc_counter_mask' and
>> the lack of register combining in rdtsc().
>>
>> However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
>> step.  i386 used to be faster here -- the first masking step of discarding
>> %edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
>> Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
>> has to do a not so slow shr.
> i386 cannot discard %edx after RDTSC since some bits from %edx come into
> the timecounter value.
These bits are part of the tsc-low pessimization.  The shift count should
always be 1, giving a TSC frequency of > INT32_MAX (usually) and > UINT32_MAX
sometimes.
When tsc-low was new, the shift count was often larger (as much as 8),
and it is still changeable by a read-only tunable, but now it is 1 in
almost all cases.  The code only limits the timecounter frequency
to UINT_MAX, except the tunable defaults to 1 so average CPUs running
at nearly 4 GHz are usually limited to about 2 GHz.  The comment about
this UINT_MAX doesn't match the code.  The comment says int, but the
code says UINT.
All that a shoft count of 1 does is waste time to lose 1 bit of accuracy.
This much accuracy is noise for most purposes.
The tunable is fairly undocumented.  Its description is "Shift to apply
for the maximum TSC frequency".  Of course, it has no effect on the TSC
frequency.  It only affects the TSC timecounter frequency.
The cputicker normally uses the TSC without even an lfence.  This use
only has to be monotonic per-CPU, so this is OK.  Also, any bugs hidden
by discarding low bits shouldn't show up per-CPU.  However, keeping
the cputicker below 4G actually has some efficiency advantages.  For
timecounters, there are no multiplications or divisions by the frequency
in the fast path, but cputicker use isn't so optimized and it does a
slow 64-bit division in cputick2usec().  Keeping cpu_tick_freqency
below UINT_MAX allows dividing by it in integer arithmetic in some cases,
This optimization is not done.
> amd64 cannot either, but amd64 does not need to mask out top bits in %rax,
> since the whole shrdl calculation occurs in 32bit registers, and the result
> is in %rax where top word is cleared by shrdl instruction automatically.
> But the clearing is not required since result is unsigned int anyway.
>
> Dissassemble of tsc_get_timecount_low() is very clear:
>   0xffffffff806767e4 <+4>:     mov    0x30(%rdi),%ecx
>   0xffffffff806767e7 <+7>:     rdtsc
>   0xffffffff806767e9 <+9>:     shrd   %cl,%edx,%eax
> ...
>   0xffffffff806767ed <+13>:    retq
> (I removed frame manipulations).
It would without the shift pessimization, since the function returns uint32_t
but rdtsc() gives uint64_t.  Removing the top bits is not needed since
tc_delta() removes them again, but the API doesn't allow expressing this.
Without the shift pessimization, we just do rdtsc() in all cases and don't
need this function call.  I think this is about 5-10 cycles faster after
some parallelism.
>>>> I prefer my way of writing this in 3 lines.  Modifying 'scale' for
>>>> the next step is especially ugly and pessimal when the next step is
>>>> in the caller and this function is not inlined.
>>> Can you show exactly what do you want ?
>>
>> Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
>> and don't pass 'scale' indirectly to bintime_helper() and don't modify
>> it there.
>>
>> Oops, there is a problem.  'scale' must be reduced iff bintime_helper()
>> was used.  Duplicate some source code so as to not need a fall-through
>> to the fast path.  See below.
> Yes, this is the reason why it is passed by pointer (C has no references).
The indirection is slow no matter how it is spelled, unless it is inlined
away.
>>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
>>> index 2656fb4d22f..6c41ab22288 100644
>>> --- a/sys/kern/kern_tc.c
>>> +++ b/sys/kern/kern_tc.c
>>> @@ -72,6 +71,7 @@ struct timehands {
>>> 	struct timecounter	*th_counter;
>>> 	int64_t			th_adjustment;
>>> 	uint64_t		th_scale;
>>> +	uint64_t		th_large_delta;
>>> 	u_int	 		th_offset_count;
>>> 	struct bintime		th_offset;
>>> 	struct bintime		th_bintime;
>>> @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
>>> 	} while (gen == 0 || gen != th->th_generation);
>>> }
>>> #else /* !FFCLOCK */
>>> +
>>> +static void
>>
>> Add __inline.  This is in the fast path for 32-bit systems.
> Compilers do not need this hand-holding, and I prefer to avoid __inline
> unless really necessary.  I checked with both clang 7.0 and gcc 8.3
> that autoinlining did occured.
But they do.  I don't use either of these compilers, and turn of inlining
as much as possible anyway using -fno-inline -fno-inline-functions-called-
once (this is very broken in clang -- -fno-inline turns off inlining of
even functions declared as __inline (like curthread), and clang doesn't
support -fno-inline -fno-inline-functions-called-once.
>> ...
>> Similarly in bintime().
> I merged two functions, finally.  Having to copy the same code is too
> annoying for this change.
>
> So I verified that:
> - there is no 64bit multiplication in the generated code, for i386 both
>  for clang 7.0 and gcc 8.3;
> - that everything is inlined, the only call from bintime/binuptime is
>  the indirect call to get the timecounter value.
I will have to fix it for compilers that I use.
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..0fd39e25058 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
+ ...
> +static void
> +binnouptime(struct bintime *bt, u_int off)
> {
> 	struct timehands *th;
> -	u_int gen;
> +	struct bintime *bts;
> +	uint64_t scale;
> +	u_int delta, gen;
>
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		bts = (struct bintime *)(vm_offset_t)th + off;
I don't like the merging.  It obscures the code with conversions like this.
> +		*bt = *bts;
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +#ifdef _LP64
> +		if (__predict_false(th->th_large_delta <= delta)) {
> +			/* Avoid overflow for scale * delta. */
> +			bintime_helper(bt, scale, delta);
> +			bintime_addx(bt, (scale & 0xffffffff) * delta);
> +		} else {
> +			bintime_addx(bt, scale * delta);
> +		}
> +#else
> +		/*
> +		 * Use bintime_helper() unconditionally, since the fast
> +		 * path in the above method is not so fast here, since
> +		 * the 64 x 32 -> 64 bit multiplication is usually not
> +		 * available in hardware and emulating it using 2
> +		 * 32 x 32 -> 64 bit multiplications uses code much
> +		 * like that in bintime_helper().
> +		 */
> +		bintime_helper(bt, scale, delta);
> +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
> +#endif
Check that this method is really better.  Without this, the complicated
part is about half as large and duplicating it is smaller than this
version.
> @@ -387,16 +430,8 @@ microuptime(struct timeval *tvp)
> void
> bintime(struct bintime *bt)
> {
> -	struct timehands *th;
> -	u_int gen;
>
> -	do {
> -		th = timehands;
> -		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_bintime;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> -		atomic_thread_fence_acq();
> -	} while (gen == 0 || gen != th->th_generation);
Duplicating this loop is much better than obfuscating it using inline
functions.  This loop was almost duplicated (except for the delta
calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
8 fflock ones).  Now it is only duplicated 16 times.
> +	binnouptime(bt, __offsetof(struct timehands, th_bintime));
> }
>
> void
Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190304043416.V5640>
