Date: Fri, 15 Mar 2019 06:37:26 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: Bruce Evans <brde@optusnet.com.au>, Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org> Subject: Re: TSC "skew" (was: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed]) Message-ID: <20190315034923.S7485@besplex.bde.org> In-Reply-To: <20190313190558.GB2492@kib.kiev.ua> References: <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org> <20190305223415.U1563@besplex.bde.org> <20190313190558.GB2492@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 13 Mar 2019, Konstantin Belousov wrote: > On Wed, Mar 06, 2019 at 12:19:38AM +1100, Bruce Evans wrote: >> [... tscdrift.c] >> I understand this program again. First, its name is actually tscdrift. >> I tested the 2015 version, and this version is still in >> /usr/src/tools/tools/tscdrift/tscdrift.c, with no changes to except to >> the copyright (rgrimes wouldn't like this) and to $FreeBSD$. >> >> The program doesn't actually measure either TSC drift or TSC skew, except >> indirectly. What it actually measures is the IPC (Inter-Process- >> Communication) time for synchronizing the drift and skew measurments, >> except bugs or intentional sloppiness in its synchronization also make it >> give an indirect measurement of similar bugs or sloppiness in normal use. >> >> After changing TESTS from 1024 to 1024000, it shows large errors in the >> negative direction, as expected from either large negative skew or program >> bugs: this is on freefall: >> >> XX CPU | TSC skew (min/avg/max/stddev) >> XX ----+------------------------------ >> XX 0 | 0 0 0 0.000 >> XX 1 | -6148 108 10232 46.871 >> XX 2 | 114 209 95676 163.359 >> XX 3 | 96 202 47835 101.250 >> XX 4 | -2223 207 34017 117.257 >> XX 5 | -2349 206 33837 106.259 >> XX 6 | -2664 213 33579 96.048 >> XX 7 | -2451 212 49242 126.428 > Note that freefall is single-socket. My belief is that due to the > construction of the RDTSC on Intels, it is impossible for the counter > to become scew on single socket. All cores are fed from the same > input signal, and most likely, even read the same uncore counter. > The later is less likely because RDTSC latency is quite low, but there > might be additional hw tricks. The large negative numbers show that even for single-socket, there are really large errors if times are compared without cross-CPU synchronization by the program. Initial skews in hardware are presumably smaller. If the hardware skew drifts then there is a large problem for the software to compensate. I think that is unlikely to be a problem. In a recent commit, mav@ wrote that some Skylake systems only return even values in rdtsc(), and some seem to have a much lower resolution of 180+ (?) cycles. 180 cycles might be from the skew being that much and the hardware refusing to return values closer than that, perhaps even on the same CPU. I already pointed out discarding bits as in TSC-low doesn't work to avoid comparing values that are too close. Rather the reverse. Compensating for skews needs as much accuracy as possible starting with measuring them. > On the other hand, for multi-socket machines, I do not think there is > anything except the external clock signal which would ensure that the > counters stay in sync. > > I tried to imagine is there is any shared hardware on multi-socket Intel > system which would give equal latency for accesses from different sockets, > and it seems that there is no such hardware. Then it is trully impossible > to observe the scew. Yes, it has relativistic problems too. A distance of 1 foot and a speed of 4GHz gives a skew of at least 4 cycles in "absolute" time. > It might be possible to measure round-trip time separately, and then > subtract it from the measured scew. The hardware can do that too, or at least provide some support. I think the "absolute" time must be determined by a distributed clock. Since the system is not usually under much acceleration, the relativistic problems are small. The clock has a knowable constant propagation speed. >> The negative "skews" occur because the server and the clients (1 client at >> a time) read the TSC with uncontrolled timing after the server opens the >> gate for this read (gate = 2). The IPC time is about 200 cycles to CPUs >> on different cores. So when neither thread is preempted, the TSC on the >> server is about 200 cycles in advance. Sometimes the server is preempted, >> so it reads its TSC later than the client (a maximum of about 6148 cycles >> later in this test). More often the client is preempted, since the IPC >> time is march larger than the time between the server opening the gate and >> the server reading its TSC. >> >> The server is also missing fencing for its TSC read, so this read may >> appear to occur several cycles before opening the gate. This gives a >> an error in the positive direction for the reported "skew" (the error >> is actually in the positive direction for the reported IPC time). It >> would be useful to measure this error by intentionally omitting fencing, >> but currently it is just a small amount of noise on top of the noise from >> preemption. >> >> After fixing the syncronization: >> >> XX CPU | TSC skew (min/avg/max/stddev) >> XX ----+------------------------------ >> XX 0 | 0 0 0 0.000 >> XX 1 | 33 62 49161 57.652 >> XX 2 | 108 169 33678 73.456 >> XX 3 | 108 171 43053 119.256 >> XX 4 | 141 169 41289 114.567 >> XX 5 | 141 169 40035 112.755 >> XX 6 | 132 186 147099 269.449 >> XX 7 | 153 183 431526 436.689 >> ... >> I tried some locked atomic ops on 'gate') and mfence instead of lfence >> to try to speed up the IPC. Nothing helped. We noticed long ago that >> fence instructions tend to be even slower that locked atomic ops for >> mutexes, and jhb guessed that this might be because fence instructions >> don't do so much to force out stores. > I am not sure I follow. MFENCE is documented by wording that implies, > without any doubts, that store buffers are flushed before the > instruction is retired. It is not so obvious for SFENCE, which > sounds like a real fence instead of full flush, at least for normal > write-back memory where it is NOP as far as ISA is considered. The program uses LFENCE partly because it is the documented method of serializing rdtsc on Intel CPUs. It only gives serialization on 1 CPU. The locking protocol gives serialization of memory accesses and rdtsc's across CPUs (after I fixed it). Only serialization of the rdtsc instructions -- their results may still be out of order if there is skew and the skew is larger than the IPC time. MFENCE is the documented method for serializing rdtsc on (some?) AMD CPUs, it is just slower on freefall's Xeon CPUs. SFENCE has little affect on freefall's Xeon CPUs. It apparently doesn't serialize rdtsc and is useless for the locking protocol. The locking protocol uses only load_acq, store_rel, fence_acq and fence_rel. These are disguised as simple C operations and compiler memory barriers. FENCE instructions apparently don't work for speeding up the store buffers. > It is known and documented in optimization manuals that locked > operations are much more efficient, but locked ops are only documented > to ensure ordering and not flush. So SFENCE is not suitable as our > barrier. I tried them too. What are they more efficient for? Is it just that they are local while fences are global? > And, the second point, LFENCE there does not work as barrier for IPC. > It only ensures that RDTSC is not started earlier than the previous > instructions. No store buffer flushing is done. Yes, I know that, and tried to find a way to flush store buffers faster. Hmm, unfenced rdtsc is correct and good as an optimization in some contexts. This program is an example. It doesn't matter for monotonicity or for getting an upper bound on time differences if the start time is in the past. >> Similar IPC is needed for updating timecounters. This benchmark indicates >> that after an update, the change usually won't be visible on other CPUs >> for 100+ cycles. Since updates are rare, this isn't much of a problem. >> >> Similar IPC is needed for comparing timecounters across CPUs. Any activity >> on different CPUs is incomparable without synchronization to establish an >> ordering. Since fences give ordering relative to memory and timecounters >> don't use anything except fences and memory order for the generation count >> to establish their order, the synchronization for comparing timecounters >> (or clock_gettime() at higher levels) must also use memory order. >> >> If the synchronization takes over 100 cycles, then smaller TSC skews don't >> matter much (they never break monotonicity, and only show up time differences >> varying by 100 or so cycles depending on which CPU measures the start and >> end events). Small differences don't matter at all. Skews may be caused > It should be more than 100 cycles for inter-socket IPC, but then genuine > RDTSC scew can accumulate much larger than 100, which is my worry. If it can accumulate at all, then it will soon accumulate to a huge value. 1 part per billion drift is 86400*4 cycles/day at 4HGHz. Since we haven't seen this, the hardware must be doing something right (or we don't have the large hardware that has problems). >> by the TSCs actually being out of sync, or hardware only syncing them on >> average (hopefully with small jitter) or bugs like missing fences. Missing >> fences don't matter much provided unserialized TSC reads aren't too far >> in the past. E.g., if we had a guarantee of only 10 cycles in the past for >> the TSC and 160 cycles for IPCs to other CPUs, then we could omit the fences. >> But IPCs to the same core are 100 cycles faster so the margin is too close >> for ommitting fences in all cases. > I have no idea if RDTSC is allowed to execute before (in program order) > earlier cache miss. If it is, then non-fenced RDTSC would easily give > much larger error than IPC sync delay. Yes, the fences are needed in general. But can we avoid them in the usual case where the program doesn't do any explicit IPCs? clock_gettime() and kernel time calls must be monotonic within a single thread. Hopefully this is automatic when the thread stays on a single CPU (this only requires rdtsc() to be monotonic. This is not as accurate as possible, but the inaccuracies are no worse than ones for delays from cache misses). Time-related IPCs are needed when the thread is moved to a different CPU. Schedulers don't understand this, but I think they do enough locking and delays to give the same effect. They could do 1 fence instruction per context switch to serialize TSCs more intentionally. This is probably faster than 1 fence instruction per timecounter call. Applications that want to compare times across threads should do the IPCs explicitly. This should be in a library function, and the library function can sprinkle fences too. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190315034923.S7485>