Date: Wed, 6 Mar 2019 00:19:38 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: Konstantin Belousov <kostikbel@gmail.com>, Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: TSC "skew" (was: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]) Message-ID: <20190305223415.U1563@besplex.bde.org> In-Reply-To: <20190305031010.I4610@besplex.bde.org> References: <20190301194217.GB68879@kib.kiev.ua> <20190302071425.G5025@besplex.bde.org> <20190302105140.GC68879@kib.kiev.ua> <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 5 Mar 2019, Bruce Evans wrote: > On Mon, 4 Mar 2019, Konstantin Belousov wrote: >* [... shift for bogus TSC-low timecounter] >> I suspect that the shift of 1 (at least) hides cross-socket inaccuracy. >> Otherwise, I think, some multi-socket machines would start showing the >> detectable backward-counting bintime(). At the frequencies at 4GHz and >> above (Intel has 5Ghz part numbers) I do not think that stability of >> 100MHz crystall and on-board traces is enough to avoid that. > > I think it is just a kludge that reduced the problem before it was fixed > properly using fences. > > Cross-socket latency is over 100 cycles according to jhb's tscskew > benchmark: on Haswell 4x2: > > CPU | TSC skew (min/avg/max/stddev) > ----+------------------------------ > 0 | 0 0 0 0.000 > 1 | 24 49 84 14.353 > 2 | 164 243 308 47.811 > 3 | 164 238 312 47.242 > 4 | 168 242 332 49.593 > 5 | 168 243 324 48.722 > 6 | 172 242 320 52.596 > 7 | 172 240 316 53.014 > > freefall is similar. Latency is apparently measured relative to CPU 0. > It is much lower to CPU 1 since that is on the same core. > > I played with this program a lot 3 and a half years ago, but forgot > mist of what I learned :-(. I tried different fencing in it. This > seems to make little difference when the program is rerun. With the > default TESTS = 1024, the min skew sometimes goes negative on freefall, > but with TESTS = 1024000 that doesn't happen. This is the opposite > of what I would expect. freefall has load average about 1. I understand this program again. First, its name is actually tscdrift. I tested the 2015 version, and this version is still in /usr/src/tools/tools/tscdrift/tscdrift.c, with no changes to except to the copyright (rgrimes wouldn't like this) and to $FreeBSD$. The program doesn't actually measure either TSC drift or TSC skew, except indirectly. What it actually measures is the IPC (Inter-Process- Communication) time for synchronizing the drift and skew measurments, except bugs or intentional sloppiness in its synchronization also make it give an indirect measurement of similar bugs or sloppiness in normal use. After changing TESTS from 1024 to 1024000, it shows large errors in the negative direction, as expected from either large negative skew or program bugs: this is on freefall: XX CPU | TSC skew (min/avg/max/stddev) XX ----+------------------------------ XX 0 | 0 0 0 0.000 XX 1 | -6148 108 10232 46.871 XX 2 | 114 209 95676 163.359 XX 3 | 96 202 47835 101.250 XX 4 | -2223 207 34017 117.257 XX 5 | -2349 206 33837 106.259 XX 6 | -2664 213 33579 96.048 XX 7 | -2451 212 49242 126.428 The negative "skews" occur because the server and the clients (1 client at a time) read the TSC with uncontrolled timing after the server opens the gate for this read (gate = 2). The IPC time is about 200 cycles to CPUs on different cores. So when neither thread is preempted, the TSC on the server is about 200 cycles in advance. Sometimes the server is preempted, so it reads its TSC later than the client (a maximum of about 6148 cycles later in this test). More often the client is preempted, since the IPC time is march larger than the time between the server opening the gate and the server reading its TSC. The server is also missing fencing for its TSC read, so this read may appear to occur several cycles before opening the gate. This gives a an error in the positive direction for the reported "skew" (the error is actually in the positive direction for the reported IPC time). It would be useful to measure this error by intentionally omitting fencing, but currently it is just a small amount of noise on top of the noise from preemption. After fixing the syncronization: XX CPU | TSC skew (min/avg/max/stddev) XX ----+------------------------------ XX 0 | 0 0 0 0.000 XX 1 | 33 62 49161 57.652 XX 2 | 108 169 33678 73.456 XX 3 | 108 171 43053 119.256 XX 4 | 141 169 41289 114.567 XX 5 | 141 169 40035 112.755 XX 6 | 132 186 147099 269.449 XX 7 | 153 183 431526 436.689 Synchronization apparenly takes a long time, especially to other cores. The minimum and avergae now gives the best-case IPC time very accurately. The average is 20-30 cycles smaller than before, probably because I fixed the fencing. The maximum and standard deviation are garbage noise from preemption. Preemption should be disabled somehow. Large drifts and skews would show up as nonsense values for the minimum IPC time. Small drifts would soon give large skews. To measure small skews, change the CPU of the server to measure the minimum IPC time in the opposite direction. Fixes: XX --- tscdrift.c 2015-07-10 06:22:36.505493000 +0000 XX +++ w.c 2019-03-05 11:22:22.232341000 +0000 XX @@ -32,6 +32,15 @@ XX #include <sys/param.h> XX #include <sys/cpuset.h> XX #include <machine/atomic.h> XX +/* XX + * XXX: atomic.h is not used. Instead we depend on x86 memory ordering and XX + * do direct assignments to and comparisons of 'gate', and sometimes add XX + * memory barriers. The correct atomic ops would do much the same with XX + * clearer spelling. The 'lock' prefix is never needed and the barriers are XX + * only to get program order so as to give acq or rel semantics for ether XX + * the loads, the stores or for buggy unfenced rdtsc's. Fences also give XX + * program order, so some of the explicit barriers are redundant. XX + */ XX #include <machine/cpu.h> XX #include <machine/cpufunc.h> XX #include <assert.h> XX @@ -45,7 +54,7 @@ XX XX #define barrier() __asm __volatile("" ::: "memory") XX XX -#define TESTS 1024 XX +#define TESTS 1024000 XX XX static volatile int gate; XX static volatile uint64_t thread_tsc; XX @@ -74,12 +83,12 @@ XX gate = 1; XX while (gate == 1) XX cpu_spinwait(); XX - barrier(); XX XX + barrier(); XX __asm __volatile("lfence"); XX thread_tsc = rdtsc(); XX - XX barrier(); XX + XX gate = 3; XX while (gate == 3) XX cpu_spinwait(); This is the client. The explicit barriers are confusing, and the blank lines are in all the wrong places. All the accesses to 'gate' need to be in program order. x86 memory ordering gives this automatically at the hardware level. 'gate' being volatile gives it at the compiler level. Both rdtsc() and storing the result to thread_tsc need to be in program order. lfence() in cpufunc.h has a memory clobber which gives the former, but we use a direct asm and need a barrier() before it to do the same thing. Then we need another barrier() after the assignment to thread_tsc so that the store for this is before the store to 'gate' (I think gate being volatile doesn't give this). This also keeps the rdtsc() in program order (the asm for rdtsc() doesn't have a memory clobber. I haven't noticed care about this being taken anywhere else). Summary: only style changes in this section. XX @@ -139,12 +148,13 @@ XX for (j = 0; j < TESTS; j++) { XX while (gate != 1) XX cpu_spinwait(); XX - gate = 2; XX - barrier(); Move down opening the gate so that it not opened until after reading the TSC on the server. XX XX + barrier(); XX + __asm __volatile("lfence"); Fencing is not critical here. Using an early TSC value just gives a larger reported IPC time. The barrier is important for getting program order of rdtsc(). XX tsc = rdtsc(); XX - XX barrier(); This barrier is still associated with the TSC read, and the blank like is moved to reflect this. Here rdtsc() must occur in program order, but storing to tsc can be after storing to 'gate'. The barrier gives ordering for the store to tsc too. XX + XX + gate = 2; XX while (gate != 3) XX cpu_spinwait(); XX gate = 4; I tried some locked atomic ops on 'gate') and mfence instead of lfence to try to speed up the IPC. Nothing helped. We noticed long ago that fence instructions tend to be even slower that locked atomic ops for mutexes, and jhb guessed that this might be because fence instructions don't do so much to force out stores. Similar IPC is needed for updating timecounters. This benchmark indicates that after an update, the change usually won't be visible on other CPUs for 100+ cycles. Since updates are rare, this isn't much of a problem. Similar IPC is needed for comparing timecounters across CPUs. Any activity on different CPUs is incomparable without synchronization to establish an ordering. Since fences give ordering relative to memory and timecounters don't use anything except fences and memory order for the generation count to establish their order, the synchronization for comparing timecounters (or clock_gettime() at higher levels) must also use memory order. If the synchronization takes over 100 cycles, then smaller TSC skews don't matter much (they never break monotonicity, and only show up time differences varying by 100 or so cycles depending on which CPU measures the start and end events). Small differences don't matter at all. Skews may be caused by the TSCs actually being out of sync, or hardware only syncing them on average (hopefully with small jitter) or bugs like missing fences. Missing fences don't matter much provided unserialized TSC reads aren't too far in the past. E.g., if we had a guarantee of only 10 cycles in the past for the TSC and 160 cycles for IPCs to other CPUs, then we could omit the fences. But IPCs to the same core are 100 cycles faster so the margin is too close for ommitting fences in all cases. Similarly for imperfect hardware. Hopefully its skew is in the +-1 cycle range, but even +-10 isn't a problem if the IPC time is a bit larger than 10 and even +-100 if the IPC time is a bit larger than 100. And the problem scales nicely with the distance of the CPUs -- when they are further apart so that hardware synchronization of their TSCs is more difficult, the IPC time is large too. Hmm, that is only with physical IPCs. Since timecounters use physical IPCs for everything, they can't work right with virtual synchronization. Something like ntpd is needed to compare times across even small local networks. It does virtual synchronization by compensating for delays. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190305223415.U1563>