Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 13 Mar 2019 21:05:58 +0200
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Mark Millard <marklmi@yahoo.com>, freebsd-hackers Hackers <freebsd-hackers@freebsd.org>, FreeBSD PowerPC ML <freebsd-ppc@freebsd.org>
Subject:   Re: TSC "skew" (was: Re: powerpc64 head -r344018 stuck sleeping problems: th->th_scale * tc_delta(th) overflows unsigned 64 bits sometimes [patched failed])
Message-ID:  <20190313190558.GB2492@kib.kiev.ua>
In-Reply-To: <20190305223415.U1563@besplex.bde.org>
References:  <20190302225513.W3408@besplex.bde.org> <20190302142521.GE68879@kib.kiev.ua> <20190303041441.V4781@besplex.bde.org> <20190303111931.GI68879@kib.kiev.ua> <20190303223100.B3572@besplex.bde.org> <20190303161635.GJ68879@kib.kiev.ua> <20190304043416.V5640@besplex.bde.org> <20190304114150.GM68879@kib.kiev.ua> <20190305031010.I4610@besplex.bde.org> <20190305223415.U1563@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Mar 06, 2019 at 12:19:38AM +1100, Bruce Evans wrote:
> On Tue, 5 Mar 2019, Bruce Evans wrote:
> 
> > On Mon, 4 Mar 2019, Konstantin Belousov wrote:
> 
> >* [... shift for bogus TSC-low timecounter]
> >> I suspect that the shift of 1 (at least) hides cross-socket inaccuracy.
> >> Otherwise, I think, some multi-socket machines would start showing the
> >> detectable backward-counting bintime().  At the frequencies at 4GHz and
> >> above (Intel has 5Ghz part numbers) I do not think that stability of
> >> 100MHz crystall and on-board traces is enough to avoid that.
> >
> > I think it is just a kludge that reduced the problem before it was fixed
> > properly using fences.
> >
> > Cross-socket latency is over 100 cycles according to jhb's tscskew
> > benchmark: on Haswell 4x2:
> >
> > CPU | TSC skew (min/avg/max/stddev)
> > ----+------------------------------
> >  0 |     0     0     0    0.000
> >  1 |    24    49    84   14.353
> >  2 |   164   243   308   47.811
> >  3 |   164   238   312   47.242
> >  4 |   168   242   332   49.593
> >  5 |   168   243   324   48.722
> >  6 |   172   242   320   52.596
> >  7 |   172   240   316   53.014
> >
> > freefall is similar.  Latency is apparently measured relative to CPU 0.
> > It is much lower to CPU 1 since that is on the same core.
> >
> > I played with this program a lot 3 and a half years ago, but forgot
> > mist of what I learned :-(.  I tried different fencing in it.  This
> > seems to make little difference when the program is rerun.  With the
> > default TESTS = 1024, the min skew sometimes goes negative on freefall,
> > but with TESTS = 1024000 that doesn't happen.  This is the opposite
> > of what I would expect.  freefall has load average about 1.
> 
> I understand this program again.  First, its name is actually tscdrift.
> I tested the 2015 version, and this version is still in
> /usr/src/tools/tools/tscdrift/tscdrift.c, with no changes to except to
> the copyright (rgrimes wouldn't like this) and to $FreeBSD$.
> 
> The program doesn't actually measure either TSC drift or TSC skew, except
> indirectly.  What it actually measures is the IPC (Inter-Process-
> Communication) time for synchronizing the drift and skew measurments,
> except bugs or intentional sloppiness in its synchronization also make it
> give an indirect measurement of similar bugs or sloppiness in normal use.
> 
> After changing TESTS from 1024 to 1024000, it shows large errors in the
> negative direction, as expected from either large negative skew or program
> bugs: this is on freefall:
> 
> XX CPU | TSC skew (min/avg/max/stddev)
> XX ----+------------------------------
> XX   0 |     0     0     0    0.000
> XX   1 | -6148   108 10232   46.871
> XX   2 |   114   209 95676   163.359
> XX   3 |    96   202 47835   101.250
> XX   4 | -2223   207 34017   117.257
> XX   5 | -2349   206 33837   106.259
> XX   6 | -2664   213 33579   96.048
> XX   7 | -2451   212 49242   126.428
Note that freefall is single-socket.  My belief is that due to the
construction of the RDTSC on Intels, it is impossible for the counter
to become scew on single socket.  All cores are fed from the same
input signal, and most likely, even read the same uncore counter.
The later is less likely because RDTSC latency is quite low, but there
might be additional hw tricks.

On the other hand, for multi-socket machines, I do not think there is
anything except the external clock signal which would ensure that the
counters stay in sync.

I tried to imagine is there is any shared hardware on multi-socket Intel
system which would give equal latency for accesses from different sockets,
and it seems that there is no such hardware.  Then it is trully impossible
to observe the scew.

It might be possible to measure round-trip time separately, and then
subtract it from the measured scew.

> 
> The negative "skews" occur because the server and the clients (1 client at
> a time) read the TSC with uncontrolled timing after the server opens the
> gate for this read (gate = 2).  The IPC time is about 200 cycles to CPUs
> on different cores.  So when neither thread is preempted, the TSC on the
> server is about 200 cycles in advance.  Sometimes the server is preempted,
> so it reads its TSC later than the client (a maximum of about 6148 cycles
> later in this test).  More often the client is preempted, since the IPC
> time is march larger than the time between the server opening the gate and
> the server reading its TSC.
> 
> The server is also missing fencing for its TSC read, so this read may
> appear to occur several cycles before opening the gate.  This gives a
> an error in the positive direction for the reported "skew" (the error
> is actually in the positive direction for the reported IPC time).  It
> would be useful to measure this error by intentionally omitting fencing,
> but currently it is just a small amount of noise on top of the noise from
> preemption.
> 
> After fixing the syncronization:
> 
> XX CPU | TSC skew (min/avg/max/stddev)
> XX ----+------------------------------
> XX   0 |     0     0     0    0.000
> XX   1 |    33    62 49161   57.652
> XX   2 |   108   169 33678   73.456
> XX   3 |   108   171 43053   119.256
> XX   4 |   141   169 41289   114.567
> XX   5 |   141   169 40035   112.755
> XX   6 |   132   186 147099   269.449
> XX   7 |   153   183 431526   436.689
> 
> Synchronization apparenly takes a long time, especially to other cores.
> The minimum and avergae now gives the best-case IPC time very accurately.
> The average is 20-30 cycles smaller than before, probably because I
> fixed the fencing.  The maximum and standard deviation are garbage noise
> from preemption.  Preemption should be disabled somehow.
> 
> Large drifts and skews would show up as nonsense values for the minimum
> IPC time.  Small drifts would soon give large skews.  To measure small
> skews, change the CPU of the server to measure the minimum IPC time in
> the opposite direction.
> 
> Fixes:
> 
> XX --- tscdrift.c	2015-07-10 06:22:36.505493000 +0000
> XX +++ w.c	2019-03-05 11:22:22.232341000 +0000
> XX @@ -32,6 +32,15 @@
> XX  #include <sys/param.h>
> XX  #include <sys/cpuset.h>
> XX  #include <machine/atomic.h>
> XX +/*
> XX + * XXX: atomic.h is not used.  Instead we depend on x86 memory ordering and
> XX + * do direct assignments to and comparisons of 'gate', and sometimes add
> XX + * memory barriers.  The correct atomic ops would do much the same with
> XX + * clearer spelling.  The 'lock' prefix is never needed and the barriers are
> XX + * only to get program order so as to give acq or rel semantics for ether
> XX + * the loads, the stores or for buggy unfenced rdtsc's.  Fences also give
> XX + * program order, so some of the explicit barriers are redundant.
> XX + */
> XX  #include <machine/cpu.h>
> XX  #include <machine/cpufunc.h>
> XX  #include <assert.h>
> XX @@ -45,7 +54,7 @@
> XX 
> XX  #define	barrier()	__asm __volatile("" ::: "memory")
> XX 
> XX -#define	TESTS		1024
> XX +#define	TESTS		1024000
> XX 
> XX  static volatile int gate;
> XX  static volatile uint64_t thread_tsc;
> XX @@ -74,12 +83,12 @@
> XX  		gate = 1;
> XX  		while (gate == 1)
> XX  			cpu_spinwait();
> XX -		barrier();
> XX 
> XX +		barrier();
> XX  		__asm __volatile("lfence");
> XX  		thread_tsc = rdtsc();
> XX -
> XX  		barrier();
> XX +
> XX  		gate = 3;
> XX  		while (gate == 3)
> XX  			cpu_spinwait();
> 
> This is the client.  The explicit barriers are confusing, and the blank
> lines are in all the wrong places.  All the accesses to 'gate' need
> to be in program order.  x86 memory ordering gives this automatically
> at the hardware level.  'gate' being volatile gives it at the compiler
> level.  Both rdtsc() and storing the result to thread_tsc need to be
> in program order.  lfence() in cpufunc.h has a memory clobber which
> gives the former, but we use a direct asm and need a barrier() before
> it to do the same thing.  Then we need another barrier() after the
> assignment to thread_tsc so that the store for this is before the store
> to 'gate' (I think gate being volatile doesn't give this).  This also
> keeps the rdtsc() in program order (the asm for rdtsc() doesn't have
> a memory clobber.  I haven't noticed care about this being taken
> anywhere else).
> 
> Summary: only style changes in this section.
> 
> XX @@ -139,12 +148,13 @@
> XX  		for (j = 0; j < TESTS; j++) {
> XX  			while (gate != 1)
> XX  				cpu_spinwait();
> XX -			gate = 2;
> XX -			barrier();
> 
> Move down opening the gate so that it not opened until after reading the
> TSC on the server.
> 
> XX 
> XX +			barrier();
> XX +			__asm __volatile("lfence");
> 
> Fencing is not critical here.  Using an early TSC value just gives a larger
> reported IPC time.  The barrier is important for getting program order of
> rdtsc().
> 
> XX  			tsc = rdtsc();
> XX -
> XX  			barrier();
> 
> This barrier is still associated with the TSC read, and the blank like is
> moved to reflect this.  Here rdtsc() must occur in program order, but
> storing to tsc can be after storing to 'gate'.  The barrier gives ordering
> for the store to tsc too.
> 
> XX +
> XX +			gate = 2;
> XX  			while (gate != 3)
> XX  				cpu_spinwait();
> XX  			gate = 4;
> 
> I tried some locked atomic ops on 'gate') and mfence instead of lfence
> to try to speed up the IPC.  Nothing helped.  We noticed long ago that
> fence instructions tend to be even slower that locked atomic ops for
> mutexes, and jhb guessed that this might be because fence instructions
> don't do so much to force out stores.
I am not sure I follow. MFENCE is documented by wording that implies,
without any doubts, that store buffers are flushed before the
instruction is retired.  It is not so obvious for SFENCE, which
sounds like a real fence instead of full flush, at least for normal
write-back memory where it is NOP as far as ISA is considered.

It is known and documented in optimization manuals that locked
operations are much more efficient, but locked ops are only documented
to ensure ordering and not flush.  So SFENCE is not suitable as our
barrier.

And, the second point, LFENCE there does not work as barrier for IPC.
It only ensures that RDTSC is not started earlier than the previous
instructions.  No store buffer flushing is done.


> 
> Similar IPC is needed for updating timecounters.  This benchmark indicates
> that after an update, the change usually won't be visible on other CPUs
> for 100+ cycles.  Since updates are rare, this isn't much of a problem.
> 
> Similar IPC is needed for comparing timecounters across CPUs.  Any activity
> on different CPUs is incomparable without synchronization to establish an
> ordering.  Since fences give ordering relative to memory and timecounters
> don't use anything except fences and memory order for the generation count
> to establish their order, the synchronization for comparing timecounters
> (or clock_gettime() at higher levels) must also use memory order.
> 
> If the synchronization takes over 100 cycles, then smaller TSC skews don't
> matter much (they never break monotonicity, and only show up time differences
> varying by 100 or so cycles depending on which CPU measures the start and
> end events).  Small differences don't matter at all.  Skews may be caused
It should be more than 100 cycles for inter-socket IPC, but then genuine
RDTSC scew can accumulate much larger than 100, which is my worry.

> by the TSCs actually being out of sync, or hardware only syncing them on
> average (hopefully with small jitter) or bugs like missing fences.  Missing
> fences don't matter much provided unserialized TSC reads aren't too far
> in the past.  E.g., if we had a guarantee of only 10 cycles in the past for
> the TSC and 160 cycles for IPCs to other CPUs, then we could omit the fences.
> But IPCs to the same core are 100 cycles faster so the margin is too close
> for ommitting fences in all cases.
I have no idea if RDTSC is allowed to execute before (in program order)
earlier cache miss.  If it is, then non-fenced RDTSC would easily give
much larger error than IPC sync delay.

> 
> Similarly for imperfect hardware.  Hopefully its skew is in the +-1 cycle
> range, but even +-10 isn't a problem if the IPC time is a bit larger than
> 10 and even +-100 if the IPC time is a bit larger than 100.  And the problem
> scales nicely with the distance of the CPUs -- when they are further apart
> so that hardware synchronization of their TSCs is more difficult, the IPC
> time is large too.
> 
> Hmm, that is only with physical IPCs.  Since timecounters use physical
> IPCs for everything, they can't work right with virtual synchronization.
> Something like ntpd is needed to compare times across even small local
> networks.  It does virtual synchronization by compensating for delays.
> 
> Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190313190558.GB2492>