From owner-freebsd-ppc@freebsd.org  Sat Mar  9 07:00:28 2019
Return-Path: <owner-freebsd-ppc@freebsd.org>
Delivered-To: freebsd-ppc@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2067F153A1F7;
 Sat,  9 Mar 2019 07:00:28 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249])
 by mx1.freebsd.org (Postfix) with ESMTP id 22F7F71414;
 Sat,  9 Mar 2019 07:00:25 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au
 [110.21.101.228])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 84AD9105AD5E;
 Sat,  9 Mar 2019 18:00:15 +1100 (AEDT)
Date: Sat, 9 Mar 2019 18:00:14 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
cc: Bruce Evans <brde@optusnet.com.au>, Mark Millard <marklmi@yahoo.com>, 
 freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, 
 FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale
 * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]
In-Reply-To: <20190307222220.GK2492@kib.kiev.ua>
Message-ID: <20190309144844.K1166@besplex.bde.org>
References: <20190302142521.GE68879@kib.kiev.ua>
 <20190303041441.V4781@besplex.bde.org>
 <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org>
 <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org>
 <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org>
 <20190306172003.GD2492@kib.kiev.ua> <20190308001005.M2756@besplex.bde.org>
 <20190307222220.GK2492@kib.kiev.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=UJetJGXy c=1 sm=1 tr=0
 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17
 a=kj9zAlcOel0A:10 a=vnREMb7VAAAA:8 a=ClMc5Of-GfaXbdAZ3JQA:9
 a=f8I4eRmMFRTVFEQH:21 a=DjpI8WK0P_VDdg0N:21 a=CjuIK1q_8ugA:10
X-Rspamd-Queue-Id: 22F7F71414
X-Spamd-Bar: -----
Authentication-Results: mx1.freebsd.org;
 spf=pass (mx1.freebsd.org: domain of brde@optusnet.com.au designates
 211.29.132.249 as permitted sender) smtp.mailfrom=brde@optusnet.com.au
X-Spamd-Result: default: False [-6.00 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-1.000,0];
 RCVD_IN_DNSWL_LOW(-0.10)[249.132.29.211.list.dnswl.org : 127.0.5.1];
 FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:211.29.132.0/23];
 FREEMAIL_FROM(0.00)[optusnet.com.au];
 MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+];
 DMARC_NA(0.00)[optusnet.com.au]; RCPT_COUNT_FIVE(0.00)[5];
 NEURAL_HAM_LONG(-1.00)[-1.000,0];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 MX_GOOD(-0.01)[cached: extmail.optusnet.com.au];
 NEURAL_HAM_SHORT(-0.83)[-0.826,0];
 IP_SCORE(-2.86)[ip: (-7.21), ipnet: 211.28.0.0/14(-3.92), asn: 4804(-3.13),
 country: AU(-0.04)]; FREEMAIL_TO(0.00)[gmail.com];
 RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[optusnet.com.au];
 ASN(0.00)[asn:4804, ipnet:211.28.0.0/14, country:AU];
 FREEMAIL_CC(0.00)[optusnet.com.au]; RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-ppc@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the PowerPC <freebsd-ppc.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-ppc/>
List-Post: <mailto:freebsd-ppc@freebsd.org>
List-Help: <mailto:freebsd-ppc-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-ppc>,
 <mailto:freebsd-ppc-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 09 Mar 2019 07:00:28 -0000

On Fri, 8 Mar 2019, Konstantin Belousov wrote:

> On Fri, Mar 08, 2019 at 01:31:30AM +1100, Bruce Evans wrote:
>> On Wed, 6 Mar 2019, Konstantin Belousov wrote:
>>
>>> On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote:
>>>> On Mon, 4 Mar 2019, Konstantin Belousov wrote:
>>>>
>>>>> On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote:
>>>>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote:
>>>>>>
>>>>>>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote:
>>> * ...
>>>> I strongly disklike the merge.

I more strongly disclike (sic) the more complete merge.  The central APIs
have even more parameters and reduced type safety to describe objects as
(offset, size) pairs.

>* ...
>>>>>>> +#else
>>>>>>> +		/*
>>>>>>> +		 * Use bintime_helper() unconditionally, since the fast
>>>>>>> +		 * path in the above method is not so fast here, since
>>>>>>> +		 * the 64 x 32 -> 64 bit multiplication is usually not
>>>>>>> +		 * available in hardware and emulating it using 2
>>>>>>> +		 * 32 x 32 -> 64 bit multiplications uses code much
>>>>>>> +		 * like that in bintime_helper().
>>>>>>> +		 */
>>>>>>> +		bintime_helper(bt, scale, delta);
>>>>>>> +		bintime_addx(bt, (uint64_t)(uint32_t)scale * delta);
>>>>>>> +#endif
>>>>>>
>>>>>> Check that this method is really better.  Without this, the complicated
>>>>>> part is about half as large and duplicating it is smaller than this
>>>>>> version.
>>>>> Better in what sence ?  I am fine with the C code, and asm code looks
>>>>> good.
>>>>
>>>> Better in terms of actually running significantly faster.  I fear the
>>>> 32-bit method is actually slightly slower for the fast path.
>>
>> I checked that it is just worse.  Significantly slower and more complicated.
>>
>> I wrote and run a lot of timing benchmarks of various versions.  All
>> times in cycles on Haswell @4.08 GHz.  On i386 except where noted:
>> ...
>> The above tests were done with the final version.  The version which tested
>> alternatives used switch (method) and takes about 20 cycles longer for the
>> fastest version, presumably by defeating parallelism.  Times for various
>> methods:
>>
>> - with clang -Os, about 54 cycles for the old method that allowed overflow,
>>    and the same for the version with the check of the overflow threshold
>>    (but with the threshold never reached), and 59 cycles for the branch-
>>    free method.  100-116 cycles with gcc-4.2.1 -Os, with the branch-free
>>    method taking 5-10 cycles longer.
>>
>> - on amd64, only a couple of cycles faster (49-50 cycles in best cases),
>>    and gcc-4.2.1 only taking a ouple of cycles longer.  The branch-free
>>    method still takes about 59 cycles so it is relatively worse.
>>
>> In userland, using the syscall for syscall for clock_gettime(), the
>> extra 5-10 cycles for the branch-free method is relatively insignificat.
>> It is about 2 nanonseconds.  Other pessimizatations are more significant.
>> Times for this syscall:
>> - amd64 now: 224 nsec (with gcc-4.2.1 -Os)
>> - i386 4+4 nopae: 500-580 nsec (depending on clang/gcc and -O2/-Os)
>>    even getpid(2) takes 280 nsec.  Add at least 140 more nsec for pae.
>> - i386 3+1: 224 nsec (with gcc 4.2.1 -Os)
>> - i386 FreeBSD-5 UP: 193 nsec (with gcc-3.3.3 -O).
>> - i386 4+4 nopae old library version of clock_gettime() compiled by
>>    clang: 29 nsec.
>>
>> In some tests, the version with the branch was even a cycle or two faster.
>> In the tests, the branch was always perfectly predicted, so costs nothing
>> except possibly by changing scheduling in an accidentally good way.  The
>> tests were too small to measure the cost of using branch prediction
>> resources.  I've never noticed a case where 1 more branch causes thrashing.
> About testing such tight loops. There is a known phenomen where Intel
> CPUs give absurd times when code in the loop has unsuitable alignment.
> The manifestation of the phenomen is very surprising and hardly
> controllable. It is due to the way the CPU front-end prefetches blocks
> of bytes for instruction decoding and jmps locations in the blocks.
>
> The only source explaining it is https://www.youtube.com/watch?v=IX16gcX4vDQ
> the talk of Intel engineer.

I know a little about such tests since I have written thousands and
interpreted millions of them (mostly automatically).  There are a lot
of other side effects of caching resources that usually make more
difference than alignment.  The most mysterious one that I noticed was
apparently due to alignment, but in a makeworld macro-benchmark.  Minor
changes in even in unused functions or data gave differences of about
2% in real time and many more % in system time.  This only showed up
on an old Turion2 (early Athlon64) system.  I think it is due to limited
cache associativity causing many cache misses by lining up unrelated
far apart code or data adresses mod some power of 2.  Padding to give
the same alignment as the best case was too hard, but I eventually
found a configuration accidentally giving nearly the best case even
with its alignments changed by small modifications the areas that I
was working on.

>* ...
>>>>>>> -	do {
>>>>>>> -		th = timehands;
>>>>>>> -		gen = atomic_load_acq_int(&th->th_generation);
>>>>>>> -		*bt = th->th_bintime;
>>>>>>> -		bintime_addx(bt, th->th_scale * tc_delta(th));
>>>>>>> -		atomic_thread_fence_acq();
>>>>>>> -	} while (gen == 0 || gen != th->th_generation);
>>>>>>
>>>>>> Duplicating this loop is much better than obfuscating it using inline
>>>>>> functions.  This loop was almost duplicated (except for the delta
>>>>>> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and
>>>>>> 8 fflock ones).  Now it is only duplicated 16 times.
>>>>> How did you counted the 16 ?  I can see only 4 instances in the unpatched
>>>>> kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not
>>>>> touch ffclock until the patch is finalized.  After that, it would be
>>>>> 1 instance for kernel and 1 for userspace.
>>>>
>>>> Grep for the end condition in this loop.  There are actually 20 of these.
>>>> I'm counting the loops and not the previously-simple scaling operation in
>>>> it.  The scaling is indeed only done for 4 cases.  I prefer the 20
>>>> duplications (except I only want about 6 of the functions).  Duplication
>>>> works even better for only 4 cases.
>>> Ok, I merged these as well.  Now there are only four loops left in kernel.
>>> I do not think that merging them is beneficial, since they have sufficiently
>>> different bodies.
>>
>> This is exacly what I don't want.
>>>
>>> I disagree with you characterization of it as obfuscation, IMO it improves
>>> the maintainability of the code by reducing number of places which need
>>> careful inspection of the lock-less algorithm.
>>
>> It makes the inspection and changes more difficult for each instance.
>> General functions are more difficult to work with since they need more
>> args to control them and can't be changed without affecting all callers.
>>
>> In another thread, you didn't like similar churn for removing td args.
> It is not similar.  I do valid refactoring there (in terms of that
> thread, I do not like the term refactoring).  I eliminate dozen instrances
> of very intricate loop which implements quite delicate lockless algorithm.
> Its trickiness can be illustrated by the fact that it is only valid
> use of thread_fence_acq() which cannot be replaced by load_acq() (similar
> case is present in sys/seq.h).

Small delicate loops are ideal for duplicating.  They are easier to
understand individually and short enough to compare without using diff
to see gratuitous and substantive differences.  Multiple instances are
only hard to write and maintain.  Since these multiple instances are
already written, they are only harder to maintain.

>> XX  void
>> XX  binuptime(struct bintime *bt)
>> XX  {
>> XX @@ -361,7 +383,7 @@
>> XX  		th = timehands;
>> XX  		gen = atomic_load_acq_int(&th->th_generation);
>> XX  		*bt = th->th_offset;
>> XX -		bintime_addx(bt, th->th_scale * tc_delta(th));
>> XX +		bintime_adddelta(bt, th);
>> XX  		atomic_thread_fence_acq();
>> XX  	} while (gen == 0 || gen != th->th_generation);
>> XX  }
>>
>> This is the kind of non-churning change that I like.
> Ok.  I made all cases where timehands are read, more uniform by
> moving calculations after the generation loop.  This makes the
> atomic part of the functions easier to see, and loop body has lower
> chance to hit generation reset.

I think this change is slightly worse:
- it increases register pressure.  'scale' and 'delta' must be read in a
   alost program program before the loop exit test.  The above order uses
   them and stores the results to memory, so more registers are free for
   the exit test.  i386 certainly runs out of registers.  IIRC, i386 now
   spills 'gen'.  It would have to spill something to load 'gen' or 'th'
   for the test.
- it enlarges the window between reading 'scale' and 'delta' and the
   caller seeing the results.  Preemption in this window gives results
   that may be far in the past.

>>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
>>> index 2656fb4d22f..7114a0e5219 100644
>>> --- a/sys/kern/kern_tc.c
>>> +++ b/sys/kern/kern_tc.c
>>> ...
>>> @@ -200,22 +201,77 @@ tc_delta(struct timehands *th)
>>>  * the comment in <sys/time.h> for a description of these 12 functions.
>>>  */
>>>
>>> -#ifdef FFCLOCK
>>> -void
>>> -fbclock_binuptime(struct bintime *bt)
>>> +static __inline void
>>> +bintime_helper(struct bintime *bt, uint64_t scale, u_int delta)
>>
>> This name is not descriptive.
>>
>>> +static __inline void
>>> +binnouptime(struct bintime *bt, u_int off)
>>
>> This name is an example of further problems with the naming scheme.
>> The bintime_ prefix used above is verbose, but it is at least a prefix
>> and is in the normal bintime_ namespace.  Here the prefix is 'bin',
>> which is neither of these.  It means bintime_ again, but this duplicates
>> 'time'.
> I agree, and I made a name getthmember for the other function which clearly
> reflect its operation.  For this one, I ended with bintime_off().

The 'get' name is another problem.  I would like all the get*time
functions and not add new names starting with 'get'.  The library
implementation already doesn't bother optimizing the get*time functions,
but always uses the hardware timecounter.

getfoo() is a more natural name than foo_get() for the action of getting
foo, but the latter is better for consistency, especially in code that
puts the subsystem name first in nearby code.

The get*time functions would be better if they were more like
time_second.  Note that time_second is racy if time_t is too larger
for the arch so that accesses to it are not atomic, as happens on
32-bit arches with premature 64-bit time_t.  However, in this 32/64
case, the race is only run every 136 years, with the next event
scheduled in 2038, so this race is even less important now than other
events scheduled in 2038.  Bintimes are 96 or 128 bits, so directly
copying a global like time_second for them would race every 1/2**32
second on 2-bit arches or every 1 second on 64-bit arches.  Most of
the loops on the generation count are for fixing these races, but
perhaps a simpler method would work.  On 64-bit arches with atomic
64 accesses on 32-bit boundaries, the following would work:
- set the lower 32 bits of the fraction to 0, or ignore them
- load the higher 32 bits of the fraction and the lower 32 bits of the
   seconds
- race once every 136 years starting in 2038 reading the higher 32 bits
   of the seconds non-atomically.
- alternatively, break instead of racing in 2038 by setting the higher
   32 bits to 0.  This is the same as using sbintimes instead of bintimes.
- drop a few more lower bits by storing a right-shifted value.  Right
   shifting by just 1 gives a race frequency of once per 272 years, with
   the next one in 2006.

> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c
> index 2656fb4d22f..8d12847f2cd 100644
> --- a/sys/kern/kern_tc.c
> +++ b/sys/kern/kern_tc.c
> @@ -200,20 +202,56 @@ tc_delta(struct timehands *th)
>  * the comment in <sys/time.h> for a description of these 12 functions.
>  */
>
> -#ifdef FFCLOCK
> -void
> -fbclock_binuptime(struct bintime *bt)
> +static __inline void
> +bintime_off(struct bintime *bt, u_int off)
> {
> 	struct timehands *th;
> -	unsigned int gen;
> +	struct bintime *btp;
> +	uint64_t scale, x;
> +	u_int delta, gen, large_delta;
>
> 	do {
> 		th = timehands;
> 		gen = atomic_load_acq_int(&th->th_generation);
> -		*bt = th->th_offset;
> -		bintime_addx(bt, th->th_scale * tc_delta(th));

You didn't fully obfuscate this by combinining this function with
getthmember() so as to deduplicate the loop.

> +		btp = (struct bintime *)((vm_offset_t)th + off);

Ugly conversion to share code.  This is technically incorrect.  Improving
the casts gives:

 	btp = (void *)(uintptr_t)((uintptr_t)(void *)th + off);

but this assumes that arithmetic on the intermediate integer does what
is espected.  uintptr_t is only guaranteed to work when the intermediate
representation held in it is not adjusted.

Fixing the API gives

     static __inline void
     bintime_off(struct bintime *btp, struct bintime *base_btp)

where base_btp is &th->th_bintime or &th->th_offset.

(th_offset and th_bintime are badly named.  th_offset is really a base
time and the offset is tc_delta().  th_bintime is also a base time.
It is the same as th_offset with another actual offset (the difference
between UTC and local time) already added to it as an optimization.  In
old versions, th_bintime didn't exist, but the related struct members
th_nanotime and th_microtime existed, since these benefit more from
not converting on every call.

My old version even documents the struct members, while -current still
has no comments.  The comments were lost to staticization.  My version
mostly adds "duh" to the banal comments after recovering them:

XX /*
XX  * XXX rotted comment cloned from <sys/timetc.h>.
XX  *
XX  * th_counter is undocumented (duh).
XX  *
XX  * th_adjustment [PPM << 16] which means that the smallest unit of correction
XX  *     you can apply amounts to 481.5 usec/year.
XX  *
XX  * th_scale is undocumented (duh).
XX  *
XX  * th_offset_count is the contents of the counter which corresponds to the
XX  *
XX  *     rest of the offset_* values.
XX  *
XX  * th_offset is undocumented (duh).
XX  *
XX  * th_microtime is undocumented (duh).
XX  *
XX  * th_nanotime is undocumented (duh).
XX  *
XX  * XXX especially massive bitrot here.  "three" is now "many"...
XX  * Each timecounter must supply an array of three timecounters.  This is needed
XX  * to guarantee atomicity in the code.  Index zero is used to transport
XX  * modifications, for instance done with sysctl, into the timecounter being
XX  * used in a safe way.  Such changes may be adopted with a delay of up to 1/HZ.
XX  * Index one and two are used alternately for the actual timekeeping.
XX  *
XX  * th_generation is undocumented (duh).
XX  *
XX  * th_next is undocumented (duh).
XX  */

> +		*bt = *btp;
> +		scale = th->th_scale;
> +		delta = tc_delta(th);
> +		large_delta = th->th_large_delta;

I had forgotten that th_scale is so volatile (it may be adjusted on
every windup).  th_large_delta is equally volatile.  So moving the
calculation outside of the loop gives even more register pressure
than I noticed above.

> 		atomic_thread_fence_acq();
> 	} while (gen == 0 || gen != th->th_generation);
> +
> +	if (__predict_false(delta < large_delta)) {
> +		/* Avoid overflow for scale * delta. */
> +		x = (scale >> 32) * delta;
> +		bt->sec += x >> 32;
> +		bintime_addx(bt, x << 32);
> +		bintime_addx(bt, (scale & 0xffffffff) * delta);
> +	} else {
> +		bintime_addx(bt, scale * delta);
> +	}
> +}
> +
> +static __inline void
> +getthmember(void *out, size_t out_size, u_int off)
> +{
> +	struct timehands *th;
> +	u_int gen;
> +
> +	do {
> +		th = timehands;
> +		gen = atomic_load_acq_int(&th->th_generation);
> +		memcpy(out, (char *)th + off, out_size);

This isn't so ugly or technically incorrect.  Now the object is generic,
so the reference to it should be passed as (void *objp, size_t objsize)
instead of the type-safe (struct bintime *base_bpt).

> +		atomic_thread_fence_acq();
> +	} while (gen == 0 || gen != th->th_generation);
> +}

I can see a useful use of copying methods like this for sysctls.  All
sysctl accesses except possibly for aligned register_t's were orginally
racy, but we sprinkled mutexes for large objects and reduced race windows
for smaller objects.  E.g., sysctl_handle_long() still makes a copy with
no locking, but this has no effect except on my i386-with-64-bit-longs
since longs have the same size as ints so are as atomic as ints on
32-bit arches.  sysctl_handle_64() uses the same method.  It works to
reduce the race window on 32-bit arches.  sysctl_handle_string() makes
a copy to malloc()ed storage.  memcpy() to that risks losing the NUL
terminator, and subsequent strlen() on the copy gives buffer overrun if
the result has no terminators.  sysctl_handle_opaque() uses a generation
count method, like the one used by timecounters before the ordering bugs
were fixed, but even more primitive and probably even more in need of
ordering fixes.

It would be good to fix all sysctl using the same generation count method
as above.  A loop at the top level might work.  I wouldn't like a structure
like the above where the top level calls individual sysctl functions which
do nothing except wrap themselves in a generic function like the above.

The above does give this structure to clock_gettime() calls.  The top
level converts the clock id to a function and the above makes the
function essentially convert back to another clock id (the offset of
the relevant field in timehands), especially for the get*time functions
where the call just copies the relevant field to userland.

Unfortunately, the indivual time functions are called directly in the
kernel.  I prefer this to generic APIs based on ids.  So that callers
can use simple efficient APIs like nanouptime() and instead of using
complicated inefficieciencies like

 	kern_clock_gettime_generic(int clock_id = CLOCK_MONOTONIC,
 	    int format_id = CLOCK_TYPE_TIMESPEC,
 	    int precision = CLOCK_PRECISION_NSEC,
 	    void *dstp = &ts);

Bruce