Date: Sat, 9 Mar 2019 18:00:14 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed] Message-ID: <20190309144844.K1166@besplex.bde.org> In-Reply-To: <20190307222220.GK2492@kib.kiev.ua> References: <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org> <20190306172003.GD2492@kib.kiev.ua> <20190308001005.M2756@besplex.bde.org> <20190307222220.GK2492@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 8 Mar 2019, Konstantin Belousov wrote: > On Fri, Mar 08, 2019 at 01:31:30AM +1100, Bruce Evans wrote: >> On Wed, 6 Mar 2019, Konstantin Belousov wrote: >> >>> On Tue, Mar 05, 2019 at 05:17:14AM +1100, Bruce Evans wrote: >>>> On Mon, 4 Mar 2019, Konstantin Belousov wrote: >>>> >>>>> On Mon, Mar 04, 2019 at 05:29:48AM +1100, Bruce Evans wrote: >>>>>> On Sun, 3 Mar 2019, Konstantin Belousov wrote: >>>>>> >>>>>>> On Mon, Mar 04, 2019 at 12:32:12AM +1100, Bruce Evans wrote: >>> * ... >>>> I strongly disklike the merge. I more strongly disclike (sic) the more complete merge. The central APIs have even more parameters and reduced type safety to describe objects as (offset, size) pairs. >* ... >>>>>>> +#else >>>>>>> + /* >>>>>>> + * Use bintime_helper() unconditionally, since the fast >>>>>>> + * path in the above method is not so fast here, since >>>>>>> + * the 64 x 32 -> 64 bit multiplication is usually not >>>>>>> + * available in hardware and emulating it using 2 >>>>>>> + * 32 x 32 -> 64 bit multiplications uses code much >>>>>>> + * like that in bintime_helper(). >>>>>>> + */ >>>>>>> + bintime_helper(bt, scale, delta); >>>>>>> + bintime_addx(bt, (uint64_t)(uint32_t)scale * delta); >>>>>>> +#endif >>>>>> >>>>>> Check that this method is really better. Without this, the complicated >>>>>> part is about half as large and duplicating it is smaller than this >>>>>> version. >>>>> Better in what sence ? I am fine with the C code, and asm code looks >>>>> good. >>>> >>>> Better in terms of actually running significantly faster. I fear the >>>> 32-bit method is actually slightly slower for the fast path. >> >> I checked that it is just worse. Significantly slower and more complicated. >> >> I wrote and run a lot of timing benchmarks of various versions. All >> times in cycles on Haswell @4.08 GHz. On i386 except where noted: >> ... >> The above tests were done with the final version. The version which tested >> alternatives used switch (method) and takes about 20 cycles longer for the >> fastest version, presumably by defeating parallelism. Times for various >> methods: >> >> - with clang -Os, about 54 cycles for the old method that allowed overflow, >> and the same for the version with the check of the overflow threshold >> (but with the threshold never reached), and 59 cycles for the branch- >> free method. 100-116 cycles with gcc-4.2.1 -Os, with the branch-free >> method taking 5-10 cycles longer. >> >> - on amd64, only a couple of cycles faster (49-50 cycles in best cases), >> and gcc-4.2.1 only taking a ouple of cycles longer. The branch-free >> method still takes about 59 cycles so it is relatively worse. >> >> In userland, using the syscall for syscall for clock_gettime(), the >> extra 5-10 cycles for the branch-free method is relatively insignificat. >> It is about 2 nanonseconds. Other pessimizatations are more significant. >> Times for this syscall: >> - amd64 now: 224 nsec (with gcc-4.2.1 -Os) >> - i386 4+4 nopae: 500-580 nsec (depending on clang/gcc and -O2/-Os) >> even getpid(2) takes 280 nsec. Add at least 140 more nsec for pae. >> - i386 3+1: 224 nsec (with gcc 4.2.1 -Os) >> - i386 FreeBSD-5 UP: 193 nsec (with gcc-3.3.3 -O). >> - i386 4+4 nopae old library version of clock_gettime() compiled by >> clang: 29 nsec. >> >> In some tests, the version with the branch was even a cycle or two faster. >> In the tests, the branch was always perfectly predicted, so costs nothing >> except possibly by changing scheduling in an accidentally good way. The >> tests were too small to measure the cost of using branch prediction >> resources. I've never noticed a case where 1 more branch causes thrashing. > About testing such tight loops. There is a known phenomen where Intel > CPUs give absurd times when code in the loop has unsuitable alignment. > The manifestation of the phenomen is very surprising and hardly > controllable. It is due to the way the CPU front-end prefetches blocks > of bytes for instruction decoding and jmps locations in the blocks. > > The only source explaining it is https://www.youtube.com/watch?v=IX16gcX4vDQ > the talk of Intel engineer. I know a little about such tests since I have written thousands and interpreted millions of them (mostly automatically). There are a lot of other side effects of caching resources that usually make more difference than alignment. The most mysterious one that I noticed was apparently due to alignment, but in a makeworld macro-benchmark. Minor changes in even in unused functions or data gave differences of about 2% in real time and many more % in system time. This only showed up on an old Turion2 (early Athlon64) system. I think it is due to limited cache associativity causing many cache misses by lining up unrelated far apart code or data adresses mod some power of 2. Padding to give the same alignment as the best case was too hard, but I eventually found a configuration accidentally giving nearly the best case even with its alignments changed by small modifications the areas that I was working on. >* ... >>>>>>> - do { >>>>>>> - th = timehands; >>>>>>> - gen = atomic_load_acq_int(&th->th_generation); >>>>>>> - *bt = th->th_bintime; >>>>>>> - bintime_addx(bt, th->th_scale * tc_delta(th)); >>>>>>> - atomic_thread_fence_acq(); >>>>>>> - } while (gen == 0 || gen != th->th_generation); >>>>>> >>>>>> Duplicating this loop is much better than obfuscating it using inline >>>>>> functions. This loop was almost duplicated (except for the delta >>>>>> calculation) in no less than 17 functions in kern_tc.c (9 tc ones and >>>>>> 8 fflock ones). Now it is only duplicated 16 times. >>>>> How did you counted the 16 ? I can see only 4 instances in the unpatched >>>>> kern_tc.c, and 3 in patched, but it is 3 and not 1 only because I do not >>>>> touch ffclock until the patch is finalized. After that, it would be >>>>> 1 instance for kernel and 1 for userspace. >>>> >>>> Grep for the end condition in this loop. There are actually 20 of these. >>>> I'm counting the loops and not the previously-simple scaling operation in >>>> it. The scaling is indeed only done for 4 cases. I prefer the 20 >>>> duplications (except I only want about 6 of the functions). Duplication >>>> works even better for only 4 cases. >>> Ok, I merged these as well. Now there are only four loops left in kernel. >>> I do not think that merging them is beneficial, since they have sufficiently >>> different bodies. >> >> This is exacly what I don't want. >>> >>> I disagree with you characterization of it as obfuscation, IMO it improves >>> the maintainability of the code by reducing number of places which need >>> careful inspection of the lock-less algorithm. >> >> It makes the inspection and changes more difficult for each instance. >> General functions are more difficult to work with since they need more >> args to control them and can't be changed without affecting all callers. >> >> In another thread, you didn't like similar churn for removing td args. > It is not similar. I do valid refactoring there (in terms of that > thread, I do not like the term refactoring). I eliminate dozen instrances > of very intricate loop which implements quite delicate lockless algorithm. > Its trickiness can be illustrated by the fact that it is only valid > use of thread_fence_acq() which cannot be replaced by load_acq() (similar > case is present in sys/seq.h). Small delicate loops are ideal for duplicating. They are easier to understand individually and short enough to compare without using diff to see gratuitous and substantive differences. Multiple instances are only hard to write and maintain. Since these multiple instances are already written, they are only harder to maintain. >> XX void >> XX binuptime(struct bintime *bt) >> XX { >> XX @@ -361,7 +383,7 @@ >> XX th = timehands; >> XX gen = atomic_load_acq_int(&th->th_generation); >> XX *bt = th->th_offset; >> XX - bintime_addx(bt, th->th_scale * tc_delta(th)); >> XX + bintime_adddelta(bt, th); >> XX atomic_thread_fence_acq(); >> XX } while (gen == 0 || gen != th->th_generation); >> XX } >> >> This is the kind of non-churning change that I like. > Ok. I made all cases where timehands are read, more uniform by > moving calculations after the generation loop. This makes the > atomic part of the functions easier to see, and loop body has lower > chance to hit generation reset. I think this change is slightly worse: - it increases register pressure. 'scale' and 'delta' must be read in a alost program program before the loop exit test. The above order uses them and stores the results to memory, so more registers are free for the exit test. i386 certainly runs out of registers. IIRC, i386 now spills 'gen'. It would have to spill something to load 'gen' or 'th' for the test. - it enlarges the window between reading 'scale' and 'delta' and the caller seeing the results. Preemption in this window gives results that may be far in the past. >>> diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c >>> index 2656fb4d22f..7114a0e5219 100644 >>> --- a/sys/kern/kern_tc.c >>> +++ b/sys/kern/kern_tc.c >>> ... >>> @@ -200,22 +201,77 @@ tc_delta(struct timehands *th) >>> * the comment in <sys/time.h> for a description of these 12 functions. >>> */ >>> >>> -#ifdef FFCLOCK >>> -void >>> -fbclock_binuptime(struct bintime *bt) >>> +static __inline void >>> +bintime_helper(struct bintime *bt, uint64_t scale, u_int delta) >> >> This name is not descriptive. >> >>> +static __inline void >>> +binnouptime(struct bintime *bt, u_int off) >> >> This name is an example of further problems with the naming scheme. >> The bintime_ prefix used above is verbose, but it is at least a prefix >> and is in the normal bintime_ namespace. Here the prefix is 'bin', >> which is neither of these. It means bintime_ again, but this duplicates >> 'time'. > I agree, and I made a name getthmember for the other function which clearly > reflect its operation. For this one, I ended with bintime_off(). The 'get' name is another problem. I would like all the get*time functions and not add new names starting with 'get'. The library implementation already doesn't bother optimizing the get*time functions, but always uses the hardware timecounter. getfoo() is a more natural name than foo_get() for the action of getting foo, but the latter is better for consistency, especially in code that puts the subsystem name first in nearby code. The get*time functions would be better if they were more like time_second. Note that time_second is racy if time_t is too larger for the arch so that accesses to it are not atomic, as happens on 32-bit arches with premature 64-bit time_t. However, in this 32/64 case, the race is only run every 136 years, with the next event scheduled in 2038, so this race is even less important now than other events scheduled in 2038. Bintimes are 96 or 128 bits, so directly copying a global like time_second for them would race every 1/2**32 second on 2-bit arches or every 1 second on 64-bit arches. Most of the loops on the generation count are for fixing these races, but perhaps a simpler method would work. On 64-bit arches with atomic 64 accesses on 32-bit boundaries, the following would work: - set the lower 32 bits of the fraction to 0, or ignore them - load the higher 32 bits of the fraction and the lower 32 bits of the seconds - race once every 136 years starting in 2038 reading the higher 32 bits of the seconds non-atomically. - alternatively, break instead of racing in 2038 by setting the higher 32 bits to 0. This is the same as using sbintimes instead of bintimes. - drop a few more lower bits by storing a right-shifted value. Right shifting by just 1 gives a race frequency of once per 272 years, with the next one in 2006. > diff --git a/sys/kern/kern_tc.c b/sys/kern/kern_tc.c > index 2656fb4d22f..8d12847f2cd 100644 > --- a/sys/kern/kern_tc.c > +++ b/sys/kern/kern_tc.c > @@ -200,20 +202,56 @@ tc_delta(struct timehands *th) > * the comment in <sys/time.h> for a description of these 12 functions. > */ > > -#ifdef FFCLOCK > -void > -fbclock_binuptime(struct bintime *bt) > +static __inline void > +bintime_off(struct bintime *bt, u_int off) > { > struct timehands *th; > - unsigned int gen; > + struct bintime *btp; > + uint64_t scale, x; > + u_int delta, gen, large_delta; > > do { > th = timehands; > gen = atomic_load_acq_int(&th->th_generation); > - *bt = th->th_offset; > - bintime_addx(bt, th->th_scale * tc_delta(th)); You didn't fully obfuscate this by combinining this function with getthmember() so as to deduplicate the loop. > + btp = (struct bintime *)((vm_offset_t)th + off); Ugly conversion to share code. This is technically incorrect. Improving the casts gives: btp = (void *)(uintptr_t)((uintptr_t)(void *)th + off); but this assumes that arithmetic on the intermediate integer does what is espected. uintptr_t is only guaranteed to work when the intermediate representation held in it is not adjusted. Fixing the API gives static __inline void bintime_off(struct bintime *btp, struct bintime *base_btp) where base_btp is &th->th_bintime or &th->th_offset. (th_offset and th_bintime are badly named. th_offset is really a base time and the offset is tc_delta(). th_bintime is also a base time. It is the same as th_offset with another actual offset (the difference between UTC and local time) already added to it as an optimization. In old versions, th_bintime didn't exist, but the related struct members th_nanotime and th_microtime existed, since these benefit more from not converting on every call. My old version even documents the struct members, while -current still has no comments. The comments were lost to staticization. My version mostly adds "duh" to the banal comments after recovering them: XX /* XX * XXX rotted comment cloned from <sys/timetc.h>. XX * XX * th_counter is undocumented (duh). XX * XX * th_adjustment [PPM << 16] which means that the smallest unit of correction XX * you can apply amounts to 481.5 usec/year. XX * XX * th_scale is undocumented (duh). XX * XX * th_offset_count is the contents of the counter which corresponds to the XX * XX * rest of the offset_* values. XX * XX * th_offset is undocumented (duh). XX * XX * th_microtime is undocumented (duh). XX * XX * th_nanotime is undocumented (duh). XX * XX * XXX especially massive bitrot here. "three" is now "many"... XX * Each timecounter must supply an array of three timecounters. This is needed XX * to guarantee atomicity in the code. Index zero is used to transport XX * modifications, for instance done with sysctl, into the timecounter being XX * used in a safe way. Such changes may be adopted with a delay of up to 1/HZ. XX * Index one and two are used alternately for the actual timekeeping. XX * XX * th_generation is undocumented (duh). XX * XX * th_next is undocumented (duh). XX */ > + *bt = *btp; > + scale = th->th_scale; > + delta = tc_delta(th); > + large_delta = th->th_large_delta; I had forgotten that th_scale is so volatile (it may be adjusted on every windup). th_large_delta is equally volatile. So moving the calculation outside of the loop gives even more register pressure than I noticed above. > atomic_thread_fence_acq(); > } while (gen == 0 || gen != th->th_generation); > + > + if (__predict_false(delta < large_delta)) { > + /* Avoid overflow for scale * delta. */ > + x = (scale >> 32) * delta; > + bt->sec += x >> 32; > + bintime_addx(bt, x << 32); > + bintime_addx(bt, (scale & 0xffffffff) * delta); > + } else { > + bintime_addx(bt, scale * delta); > + } > +} > + > +static __inline void > +getthmember(void *out, size_t out_size, u_int off) > +{ > + struct timehands *th; > + u_int gen; > + > + do { > + th = timehands; > + gen = atomic_load_acq_int(&th->th_generation); > + memcpy(out, (char *)th + off, out_size); This isn't so ugly or technically incorrect. Now the object is generic, so the reference to it should be passed as (void *objp, size_t objsize) instead of the type-safe (struct bintime *base_bpt). > + atomic_thread_fence_acq(); > + } while (gen == 0 || gen != th->th_generation); > +} I can see a useful use of copying methods like this for sysctls. All sysctl accesses except possibly for aligned register_t's were orginally racy, but we sprinkled mutexes for large objects and reduced race windows for smaller objects. E.g., sysctl_handle_long() still makes a copy with no locking, but this has no effect except on my i386-with-64-bit-longs since longs have the same size as ints so are as atomic as ints on 32-bit arches. sysctl_handle_64() uses the same method. It works to reduce the race window on 32-bit arches. sysctl_handle_string() makes a copy to malloc()ed storage. memcpy() to that risks losing the NUL terminator, and subsequent strlen() on the copy gives buffer overrun if the result has no terminators. sysctl_handle_opaque() uses a generation count method, like the one used by timecounters before the ordering bugs were fixed, but even more primitive and probably even more in need of ordering fixes. It would be good to fix all sysctl using the same generation count method as above. A loop at the top level might work. I wouldn't like a structure like the above where the top level calls individual sysctl functions which do nothing except wrap themselves in a generic function like the above. The above does give this structure to clock_gettime() calls. The top level converts the clock id to a function and the above makes the function essentially convert back to another clock id (the offset of the relevant field in timehands), especially for the get*time functions where the call just copies the relevant field to userland. Unfortunately, the indivual time functions are called directly in the kernel. I prefer this to generic APIs based on ids. So that callers can use simple efficient APIs like nanouptime() and instead of using complicated inefficieciencies like kern_clock_gettime_generic(int clock_id = CLOCK_MONOTONIC, int format_id = CLOCK_TYPE_TIMESPEC, int precision = CLOCK_PRECISION_NSEC, void *dstp = &ts); Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190309144844.K1166>