From owner-freebsd-ppc@freebsd.org  Sun Mar  3 13:32:23 2019
Return-Path: <owner-freebsd-ppc@freebsd.org>
Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id D82A8150AD29;
 Sun,  3 Mar 2019 13:32:22 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 2DFAB84F1C;
 Sun,  3 Mar 2019 13:32:21 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au
 [110.21.101.228])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 2FB8F436AEC;
 Mon,  4 Mar 2019 00:32:12 +1100 (AEDT)
Date: Mon, 4 Mar 2019 00:32:12 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: Mark Millard <marklmi@yahoo.com>, 
 freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, 
 FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale
 * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
In-Reply-To: <20190303111931.GI68879@kib.kiev.ua>
Message-ID: <20190303223100.B3572@besplex.bde.org>
References: <962D78C3-65BE-40C1-BB50-A0088223C17B@yahoo.com>
 <28C2BB0A-3DAA-4D18-A317-49A8DD52778F@yahoo.com>
 <20190301112717.GW2420@kib.kiev.ua>
 <20190302043936.A4444@besplex.bde.org> <20190301194217.GB68879@kib.kiev.ua>
 <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua>
 <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua>
 <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0
 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17
 a=kj9zAlcOel0A:10 a=L2uf15vNulIdqj9DapQA:9 a=CjuIK1q_8ugA:10
X-Rspamd-Queue-Id: 2DFAB84F1C
X-Spamd-Bar: ------
Authentication-Results: mx1.freebsd.org
X-Spamd-Result: default: False [-6.90 / 15.00];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 NEURAL_HAM_SHORT(-0.90)[-0.900,0]; REPLY(-4.00)[];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]
X-BeenThere: freebsd-ppc@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the PowerPC <freebsd-ppc.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ppc/>
List-Post: <mailto:freebsd-ppc@freebsd.org>
List-Help: <mailto:freebsd-ppc-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 03 Mar 2019 13:32:23 -0000

On Sun, 3 Mar 2019, Konstantin Belousov wrote:

> On Sun, Mar 03, 2019 at 04:43:20AM +1100, Bruce Evans wrote:
>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Sun, Mar 03, 2019 at 12:03:18AM +1100, Bruce Evans wrote:
>>>> On Sat, 2 Mar 2019, Konstantin Belousov wrote:
>* ...
>>>> I don't changing this at all this.  binuptime() was carefully written
>>>> to not need so much 64-bit arithmetic.
>>>>
>>>> If this pessimization is allowed, then it can also handle a 64-bit
>>>> deltas.  Using the better kernel method:
>>>>
>>>>  		if (__predict_false(delta >= th->th_large_delta)) {
>>>>  			bt->sec += (scale >> 32) * (delta >> 32);
>>>>  			x = (scale >> 32) * (delta & 0xffffffff);
>>>>  			bt->sec += x >> 32;
>>>>  			bintime_addx(bt, x << 32);
>>>>  			x = (scale & 0xffffffff) * (delta >> 32);
>>>>  			bt->sec += x >> 32;
>>>>  			bintime_addx(bt, x << 32);
>>>>  			bintime_addx(bt, (scale & 0xffffffff) *
>>>>  			    (delta & 0xffffffff));
>>>>  		} else
>>>>  			bintime_addx(bt, scale * (delta & 0xffffffff));
>>> This only makes sense if delta is extended to uint64_t, which requires
>>> the pass over timecounters.
>>
>> Yes, that was its point.  It is a bit annoying to have a hardware
>> timecounter like the TSC that doesn't wrap naturally, but then make it
>> wrap by masking high bits.
>>
>> The masking step is also a bit wasteful.  For the TSC, it is 1 step to
>> discard high bids at the register level, then another step to apply the
>> nask to discard th high bits again.
> rdtsc-low is implemented in the natural way, after RDTSC, no register
> combining into 64bit value is done, instead shrd operates on %edx:%eax
> to get the final result into %eax.  I am not sure what you refer to.

I was referring mostly to the masking step '& tc->tc_counter_mask' and
the lack of register combining in rdtsc().

However, shrd in rdtsc-low (tsc_get_timecount_low()) does a slow combining
step.  i386 used to be faster here -- the first masking step of discarding
%edx doesn't take any code.  amd64 has to mask out the top bits in %rax.
Now for the tsc-low pessimization, i386 has to do a slow shrd, and amd64
has to do a not so slow shr.

Then the '& tc->tc_counter_mask' step has no effect.

All this is wrapped in many layers of function calls which are quite slow
but this lets the other operations run in parallel on some CPUs.

>>>>  		/* 32-bit arches did the next multiplication implicitly. */
>>>>  		x = (scale >> 32) * delta;
>>>>  		/*
>>>>  		 * And they did the following shifts and most of the adds
>>>>  		 * implicitly too.  Except shifting x left by 32 lost the
>>>>  		 * seconds part that the next line handles.  The next line
>>>>  		 * is the only extra cost for them.
>>>>  		 */
>>>>  		bt->sec += x >> 32;
>>>>  		bintime_addx(bt, (x << 32) + (scale & 0xffffffff) * delta);
>>>
>>> Ok, what about the following.
>>
>> I'm not sure that I really want this, even if the pessimization is done.
>> But it avoids using fls*(), so is especially good for 32-bit systems and
>> OK for 64-bit systems too, especially in userland where fls*() is in the
>> fast path.
> For userland I looked at the generated code, and BSR usage seems to be
> good enough, for default compilation settings with clang.

I use gcc-4.2.1, and it doesn't do this optimization.

I already reported this in connection with fixing calcru1().  calcru1()
is unnecessarily several times slower on i386 than on amd64 even after
avoiding using flsll() on it.  The main slowness is in converting 'usec'
to tv_sec and tv_usec, due to the bad design and implementation of the
__udivdi3 and __umoddi3 libcalls.  The bad design is having to make 2
libcalls to get the quotient and remainder.  The bad implementation is
the portable C version in libkern.  libgcc provides a better implementation,
but this is not available in the kernel.

>>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
>>> index 2656fb4d22f..2e28f872229 100644
>>> --- a/sys/kern/kern_tc.c
>>> +++ b/sys/kern/kern_tc.c
>>> ...
>>> @@ -351,17 +352,44 @@ fbclock_getmicrotime(struct timeval *tvp)
>>> 	} while (gen == 0 || gen != th->th_generation);
>>> }
>>> #else /* !FFCLOCK */
>>> +
>>> +static void
>>> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
>>> +{
>>> +	uint64_t x;
>>> +
>>> +	x = (*scale >> 32) * delta;
>>> +	*scale &= 0xffffffff;
>>> +	bt->sec += x >> 32;
>>> +	bintime_addx(bt, x << 32);
>>> +}
>>
>> It is probably best to not inline the slow path, but clang tends to
>> inline everything anyway.
> It does not matter if it inlines it, as far as it is moved out of the
> linear sequence for the fast path.
>>
>> I prefer my way of writing this in 3 lines.  Modifying 'scale' for
>> the next step is especially ugly and pessimal when the next step is
>> in the caller and this function is not inlined.
> Can you show exactly what do you want ?

Just write 'scale & 0xffffffff' for the low bits of 'scale' in callers,
and don't pass 'scale' indirectly to bintime_helper() and don't modify
it there.

Oops, there is a problem.  'scale' must be reduced iff bintime_helper()
was used.  Duplicate some source code so as to not need a fall-through
to the fast path.  See below.

>>> void
>>> binuptime(struct bintime *bt)
>>> {
>>> 	struct timehands *th;
>>> -	u_int gen;
>>> +	uint64_t scale;
>>> +	u_int delta, gen;
>>>
>>> 	do {
>>> 		th = timehands;
>>> 		gen = atomic_load_acq_int(&th->th_generation);
>>> 		*bt = th->th_offset;
>>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
>>> +		scale = th->th_scale;
>>> +		delta = tc_delta(th);
>>> +#ifdef _LP64
>>> +		/* Avoid overflow for scale * delta. */
>>> +		if (__predict_false(th->th_large_delta <= delta))
>>> +			bintime_helper(bt, &scale, delta);
>>> +		bintime_addx(bt, scale * delta);
>>> +#else
>>> +		/*
>>> +		 * Also avoid (uint64_t, uint32_t) -> uint64_t
>>> +		 * multiplication on 32bit arches.
>>> +		 */
>>
>> "Also avoid overflow for ..."
>>
>>> +		bintime_helper(bt, &scale, delta);
>>> +		bintime_addx(bt, (u_int)scale * delta);
>>
>> The cast should be to uint32_t, but better write it as & 0xffffffff as
>> elsewhere.

This is actually very broken.  The cast gives a 32 x 32 -> 32 bit
multiplication, but all 64 bits of the result are needed.

>>
>> bintime_helper() already reduced 'scale' to 32 bits.  The cast might be
>> needed to tell the compiler this, especially when the function is not
>> inlined.  Better not do it in the function.  The function doesn't even
>> use the reduced value.
> I used cast to use 32x32 multiplication.  I am not sure that all (or any)
> compilers are smart enough to deduce that they can use 32 bit mul.

Writing the reduction to 32 bits using a mask instead of a cast automatically
avoids the bug, but might not give the optimization.

They do do this optimization, but might need the cast as well as the mask.
At worst, '(uint64_t)(uint32_t)(scale & 0xffffffff)', where the mask is
now redundant but the cast back to 64 bits is needed if the cast to 32
bits is used.

You already depended on them not needing the cast for the expression
'(*scale >> 32) * delta'.  Here delta is 32 bits and the other operand
must remain 64 bits so that after default promotions the multiplication
is 64 x 64 -> 64 bits, but the compiler should optimize this to
32 x 32 -> 64 bits.  (*scale >> 32) would need to be cast to 32 bits
and then back to 64 bits if the compiler can't do this automatically.

I checked what some compilers do.  Both gcc-3.3.3 and gcc-4.2.1
optimize only (uint64_t)x * y (where x and y have type uint32_t), so they
need to be helped by casts if x and y have have a larger type even if
their values obviously fit in 32 bits.  So the expressions should be
written as:

 	(uint64_t)(uint32_t)(scale >> 32) * delta;

and

 	(uint64_t)(uint32_t)scale * delta;

The 2 casts are always needed, but the '& 0xffffffff' operation doesn't
need to be explicit because the cast does.

>> This needs lots of testing of course.
>
> Current kernel-only part of the change is below, see the question about
> your preference for binuptime_helper().
>
> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..6c41ab22288 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> @@ -72,6 +71,7 @@ struct timehands {
> 	struct timecounter	*th_counter;
> 	int64_t			th_adjustment;
> 	uint64_t		th_scale;
> +	uint64_t		th_large_delta;
> 	u_int	 		th_offset_count;
> 	struct bintime		th_offset;
> 	struct bintime		th_bintime;
> @@ -351,17 +351,45 @@ fbclock_getmicrotime(struct timeval *tvp)
> 	} while (gen == 0 || gen != th->th_generation);
> }
> #else /* !FFCLOCK */
> +
> +static void

Add __inline.  This is in the fast path for 32-bit systems.

> +bintime_helper(struct bintime *bt, uint64_t *scale, u_int delta)
> +{
> +	uint64_t x;
> +
> +	x = (*scale >> 32) * delta;
> +	*scale &= 0xffffffff;

Remove the '*' on scale, cast (scale >> 32) to
(uint64_t)(uint32_t)(scale >> 32), and remove the change to *scale.

> +	bt->sec += x >> 32;
> +	bintime_addx(bt, x << 32);
> +}
> +
> void
> binuptime(struct bintime *bt)
> {
> 	struct timehands *th;
> -	u_int gen;
> +	uint64_t scale;
> +	u_int delta, gen;
>
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> 		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +#ifdef _LP64
> +		/* Avoid overflow for scale * delta. */
> +		if (__predict_false(th->th_large_delta <= delta))
> +			bintime_helper(bt, &scale, delta);
> +		bintime_addx(bt, scale * delta);

Change to:

 		if (__predict_false(th->th_large_delta <= delta)) {
 			bintime_helper(bt, scale, delta);
 			bintime_addx(bt, (scale & 0xffffffff) * delta);
 		} else
 			bintime_addx(bt, scale * delta);

> +#else
> +		/*
> +		 * Avoid both overflow as above and
> +		 * (uint64_t, uint32_t) -> uint64_t
> +		 * multiplication on 32bit arches.
> +		 */

This is a bit unclear.  Better emphasize avoidance of the 64 x 32 -> 64 bit
multiplication.  Something like:

 		/*
 		 * Use bintime_helper() unconditionally, since the fast
 		 * path in the above method is not so fast here, since
 		 * the 64 x 32 -> 64 bit multiplication is usually not
 		 * available in hardware and emulating it using 2
 		 * 32 x 32 -> 64 bit multiplications uses code much
 		 * like that in bintime_helper().
 		 */

> +		bintime_helper(bt, &scale, delta);
> +		bintime_addx(bt, (uint32_t)scale * delta);
> +#endif

Remove '&' as usual, and fix this by casting the reduced scale back to
64 bits.

Similarly in bintime().

Similarly in libc -- don't use the slow flsll() method in the 32-bit
case where it is especially slow.  Don't use it in the 64-bit case either,
since this would need to be change when th_large_delta is added to the
API.

Now I don't like my method in the kernel.  It is is unnecessarily
complicated to have a specal case, and not faster either.

Bruce