Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 24 Mar 2019 05:05:44 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Mark Johnston <markj@freebsd.org>, rgrimes@freebsd.org,  src-committers@freebsd.org, svn-src-all@freebsd.org,  svn-src-head@freebsd.org
Subject:   Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386
Message-ID:  <20190324035947.D3250@besplex.bde.org>
In-Reply-To: <20190323013628.U2614@besplex.bde.org>
References:  <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> <20190321180738.P1024@besplex.bde.org> <20190321193237.GF8186@raichu> <20190323013628.U2614@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Mar 2019, Bruce Evans wrote:

I now have an almost perfect skew measuring program.  It has a variance of
about 2 cycles.

Output on i4790k (only for CPU 0 vs CPU [1-7], since nothing different
happens for the general CPU i vs CPU j comparison):

XX            (min/avg/max/sd)
XX |   CPUs  |      latij     |      latji     |       skew      |   nr   |
XX |---------+----------------+----------------+-----------------+--------|
XX  0 <->  1   128/145/152/ 3   132/145/156/ 3    -12/ -0/  6/ 2      239
XX  0 <->  2   216/222/224/ 2   216/218/220/ 2     -2/  2/  4/ 1      239
XX  0 <->  3   216/222/224/ 2   216/218/220/ 2     -2/  2/  4/ 1      239
XX  0 <->  4   220/224/228/ 4   220/224/228/ 4     -4/  0/  4/ 3      239
XX  0 <->  5   224/225/228/ 2   220/223/228/ 3     -2/  1/  2/ 2      239
XX  0 <->  6   220/226/232/ 2   224/227/232/ 2     -4/ -1/  2/ 1      239
XX  0 <->  7   224/226/228/ 2   224/227/228/ 2     -2/ -1/  0/ 1      239

The main bug in the previous version was that it measured mainly the
variance of IPC latency due to cache misses.  Cache misses apparently
sometimes cost 1400 cycles on i4790k.  This was shown by a max vs min
diference of about 1400 cycles for latij and latji (the min was about
200 as above, but the max was about 1600).  In the above, the difference
between the max and the min for latij is at most 24.

The larger ranges (though no larger variances) for the 0 <-> 1 case are
probably not accidental.  This case shares a core.  Masking interrupts
should prevent HTT contention in the other cases.

This depends on disabling interrupts or the system being fairly idle.
With interrupts enabled and the system not idle, on freefall's Xeon:

XX |   CPUs  |      latij     |      latji     |       skew      |   nr   |
XX |---------+----------------+----------------+-----------------+--------|
XX  0 <->  1   422/482/514/15   423/480/502/12    -39/  1/ 20/10      239
XX  0 <->  2   679/691/1340/43   732/744/746/ 5    -33/-27/304/22      238
XX  0 <->  3   679/688/707/ 8   732/744/746/ 5    -33/-28/-12/ 6      239
XX  0 <->  4   713/737/753/ 5   726/726/726/ 0     -6/  5/ 13/ 3      239
XX  0 <->  5   726/738/964/15   726/726/726/ 0      0/  6/119/ 8      238
XX  0 <->  6   686/695/713/13   765/768/778/ 5    -45/-36/-26/ 7      239
XX  0 <->  7   686/696/713/12   765/768/779/ 5    -46/-36/-26/ 7      239

I just noticed that the minimum latencies exploded from 180-220 in previous
versions to 679-726 here.  Another run gave the old latencies:

XX |   CPUs  |      latij     |      latji     |       skew      |   nr   |
XX |---------+----------------+----------------+-----------------+--------|
XX  0 <->  1   114/133/150/ 4   120/133/147/ 2     -9/ -0/  9/ 2      239
XX  0 <->  2   216/220/222/ 3   213/216/225/ 3     -4/  2/  4/ 3      239
XX  0 <->  3   216/220/222/ 3   213/216/219/ 3     -1/  2/  4/ 3      239
XX  0 <->  4   204/224/234/12   201/220/231/13    -13/  2/ 16/11      239
XX  0 <->  5   204/224/234/12   201/220/231/13    -13/  2/ 16/11      239
XX  0 <->  6   219/232/246/13   228/229/231/ 1     -6/  1/  9/ 7      239
XX  0 <->  7   219/232/246/13   228/230/231/ 1     -6/  1/  9/ 7      239

The skews of -27 through -36 in the first run are probably real.  I think
rdtsc uses a common 100 MHz clock.  Differences of 1 cycle in that give
differences of 33 cycles for rdtsc.  Apparently the firmware doesn't
do a good job of hiding this granularity.  The error of 1 cycle at 100
MHz persists for many TSC cycles (for more like 4000 samples than the 239
used here), and across most CPU pairs in a run.

i4790K shows the same behaviour of skews of 1 cycle at 100 MHz, but less
often (once in every 10-20 runs):

XX |   CPUs  |      latij     |      latji     |       skew      |   nr   |
XX |---------+----------------+----------------+-----------------+--------|
XX  0 <->  1   116/146/156/ 5   128/145/152/ 4    -16/  1/ 12/ 3      239
XX  0 <->  2   172/182/280/ 7   256/261/596/24   -206/-39/  8/12      237
XX  0 <->  3   176/182/184/ 3   256/258/260/ 2    -42/-38/-36/ 2      239

The TSC frequency is 4.08 GHz, so an error of 1 cycle at 100 MHz is 40.8
rdtsc cycles.

XX  0 <->  4   220/223/228/ 2   220/225/232/ 4     -4/ -1/  4/ 2      239
XX  0 <->  5   220/223/228/ 3   224/225/228/ 2     -2/ -1/  2/ 1      239
XX  0 <->  6   220/226/228/ 2   224/227/228/ 2     -4/ -1/  2/ 1      239
XX  0 <->  7   220/226/232/ 2   224/227/232/ 2     -4/ -1/  2/ 1      239

Now full output to show many more skews near 41:

XX  1 <->  0   140/145/152/ 2   132/148/156/ 3     -4/ -2/  8/ 2      239
XX  1 <->  2   172/185/228/ 9   256/261/552/19   -184/-38/-16/11      239
XX  1 <->  3   180/182/184/ 2   256/258/260/ 2    -40/-38/-36/ 2      239
XX  1 <->  4   220/223/224/ 2   220/225/232/ 4     -4/ -1/  2/ 2      239
XX  1 <->  5   220/223/228/ 2   220/225/232/ 4     -4/ -1/  4/ 2      239
XX  1 <->  6   220/226/232/ 3   224/227/232/ 2     -4/ -1/  2/ 1      239
XX  1 <->  7   220/226/232/ 3   224/228/232/ 2     -6/ -1/  4/ 2      239
XX  2 <->  0   256/258/260/ 2   180/182/184/ 2     36/ 38/ 40/ 1      239
XX  2 <->  1   256/258/260/ 2   176/182/184/ 2     36/ 38/ 42/ 2      239
XX  2 <->  3   148/149/152/ 2   140/144/148/ 3      0/  3/  6/ 2      239
XX  2 <->  4   244/253/256/ 4   168/173/180/ 2     34/ 40/ 42/ 2      239
XX  2 <->  5   244/252/256/ 4   172/173/176/ 2     36/ 40/ 42/ 2      239
XX  2 <->  6   248/254/260/ 3   172/176/180/ 2     36/ 39/ 42/ 2      239
XX  2 <->  7   244/254/260/ 4   172/177/180/ 3     32/ 38/ 42/ 3      239
XX  3 <->  0   256/258/260/ 2   176/182/184/ 2     36/ 38/ 42/ 2      239
XX  3 <->  1   256/258/260/ 2   180/182/184/ 2     36/ 38/ 40/ 1      239
XX  3 <->  2   140/145/152/ 3   144/148/152/ 2     -4/ -2/  4/ 2      239
XX  3 <->  4   244/253/256/ 4   168/173/180/ 2     34/ 40/ 42/ 2      239
XX  3 <->  5   244/253/256/ 4   168/173/180/ 2     34/ 40/ 42/ 2      239
XX  3 <->  6   248/254/260/ 3   172/176/180/ 2     36/ 39/ 42/ 2      239
XX  3 <->  7   248/254/260/ 3   172/176/180/ 2     36/ 39/ 42/ 2      239
XX  4 <->  0   220/224/228/ 2   220/224/228/ 2     -2/ -0/  2/ 1      239
XX  4 <->  1   220/225/228/ 3   220/223/228/ 3     -2/  1/  4/ 2      239
XX  4 <->  2   168/173/180/ 2   244/253/256/ 4    -44/-40/-34/ 2      239
XX  4 <->  3   168/173/180/ 2   244/253/256/ 4    -44/-40/-34/ 2      239
XX  4 <->  5   128/145/152/ 3   140/145/152/ 2     -8/ -0/  6/ 2      239
XX  4 <->  6   212/218/220/ 3   212/218/220/ 3     -4/  0/  4/ 3      239
XX  4 <->  7   212/220/224/ 3   208/216/220/ 3     -4/  2/  4/ 2      239
XX  5 <->  0   220/225/228/ 3   220/223/228/ 3     -4/  1/  4/ 3      239
XX  5 <->  1   220/224/228/ 2   220/224/228/ 2     -2/ -0/  2/ 1      239
XX  5 <->  2   168/173/180/ 2   244/253/256/ 4    -44/-40/-34/ 2      239
XX  5 <->  3   168/173/180/ 2   244/253/256/ 4    -44/-40/-34/ 2      239
XX  5 <->  4   144/146/148/ 2   144/151/156/ 4     -6/ -3/  2/ 2      239
XX  5 <->  6   212/220/224/ 3   208/216/220/ 3     -4/  2/  4/ 2      239
XX  5 <->  7   212/218/220/ 3   212/218/220/ 3     -4/  0/  4/ 3      239
XX  6 <->  0   224/227/232/ 2   220/226/232/ 2     -2/  1/  6/ 2      239
XX  6 <->  1   224/227/232/ 2   220/226/232/ 2     -2/  1/  6/ 2      239
XX  6 <->  2   172/177/180/ 3   244/254/260/ 4    -42/-38/-32/ 3      239
XX  6 <->  3   172/177/180/ 3   244/254/260/ 4    -42/-38/-32/ 3      239
XX  6 <->  4   208/216/220/ 3   216/220/224/ 2     -4/ -2/  2/ 2      239
XX  6 <->  5   212/218/220/ 3   212/218/220/ 3     -4/ -0/  4/ 2      239
XX  6 <->  7   128/147/152/ 4   128/145/156/ 4    -10/  1/  6/ 3      239
XX  7 <->  0   224/227/232/ 2   220/226/228/ 2     -2/  1/  6/ 2      239
XX  7 <->  1   224/227/232/ 2   220/226/228/ 2     -2/  1/  6/ 2      239
XX  7 <->  2   172/176/180/ 2   248/254/260/ 3    -42/-39/-34/ 2      239
XX  7 <->  3   172/177/180/ 3   244/254/260/ 4    -42/-38/-32/ 3      239
XX  7 <->  4   208/216/220/ 3   216/220/224/ 2     -4/ -2/  2/ 2      239
XX  7 <->  5   212/218/220/ 3   212/218/220/ 3     -4/ -0/  4/ 2      239
XX  7 <->  6   132/145/152/ 3   116/147/152/ 4     -8/ -1/ 14/ 2      239

The error of ~41 cycles showed up in 24 out of 56 independent samples
(independent except for being sequential, but spread over at least 100k
TSC cycles.  The variance for all these ~41 cycle skews is the same 2 or
3 as for the small skews, so I think the large skews are not measurement
errors.

This output was produced by:

XX /* pthread and cpu management parts copied from tscdrift.c. */
XX 
XX /*-
XX  * Copyright (c) 2014 Advanced Computing Technologies LLC
XX  * Written by: John H. Baldwin <jhb@FreeBSD.org>
XX  * All rights reserved.
XX  *
XX  * Redistribution and use in source and binary forms, with or without
XX  * modification, are permitted provided that the following conditions
XX  * are met:
XX  * 1. Redistributions of source code must retain the above copyright
XX  *    notice, this list of conditions and the following disclaimer.
XX  * 2. Redistributions in binary form must reproduce the above copyright
XX  *    notice, this list of conditions and the following disclaimer in the
XX  *    documentation and/or other materials provided with the distribution.
XX  *
XX  * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
XX  * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
XX  * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
XX  * ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
XX  * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
XX  * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
XX  * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
XX  * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
XX  * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
XX  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
XX  * SUCH DAMAGE.
XX  */
XX 
XX #include <sys/param.h>
XX #include <sys/cpuset.h>
XX 
XX #include <machine/atomic.h>
XX #include <machine/cpufunc.h>
XX 
XX #include <assert.h>
XX #include <err.h>
XX #include <fcntl.h>
XX #include <inttypes.h>
XX #include <math.h>
XX #include <pthread.h>
XX #include <stdio.h>
XX #include <stdlib.h>
XX #include <string.h>
XX 
XX /* LOGSIZE should be small enough for everything to fit in L1. */
XX #define LOGSIZE	0x100
XX 
XX static int ncpu;
XX /*
XX  * The sepatrate log buffers are to  reduce cache contention.  This is not
XX  * so important with everything in L1.
XX  */
XX static uint64_t tsc_log_master[LOGSIZE] __aligned(PAGE_SIZE);
XX static uint64_t tsc_log_slave[LOGSIZE] __aligned(PAGE_SIZE);
XX static volatile u_int tsc_loghead;

The main fixes are here.  By keeping LOGSIZE significantly less than 4K,
the working set is smaller than 32K so fits in L1.

Alignment to PAGE_SIZE and separate log buffers are refinements of this.

There are a few more smaller refinements and some bug fixes.

XX 
XX /* Bind the current thread to the specified CPU. */
XX static void
XX bind_cpu(int cpu)
XX {
XX 	cpuset_t set;
XX 
XX 	CPU_ZERO(&set);
XX 	CPU_SET(cpu, &set);
XX 	if (cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(set),
XX 	    &set) < 0)
XX 		err(1, "cpuset_setaffinity(%d)", cpu);
XX }
XX 
XX static void *
XX action(void *arg)
XX {
XX 	uint64_t *logp;
XX 	int fd, head, master;
XX 
XX 	master = (intptr_t)arg;
XX 	logp = master ? tsc_log_master : tsc_log_slave;
XX 
XX 	/* Attempt to fault in everything before disabling interrupts. */
XX 	memset(logp, 0, sizeof(tsc_log_master));
XX 
XX 	/*
XX 	 * Wait for the master to bind to its (necessarily different) CPU
XX 	 * before disabling interrupts.
XX 	 */
XX 	while (atomic_load_acq_int(&tsc_loghead) == -1)
XX 		;
XX 
XX 	/* If possible, run the main loop with interrupts disabled. */
XX 	if ((fd = open("/dev/io", O_RDONLY)) != -1)
XX 		disable_intr();
XX 
XX 	/* Attempt to cache everything again after possible preemption. */
XX 	memset(logp, 0, sizeof(tsc_log_master));

The last line is a refinement.

XX 
XX 	/*
XX 	 * (head % 1) selects the log buffer (0 = master) and (head / 2) is
XX 	 * the index in this buffer.
XX 	 */
XX 	for (head = master ? 0 : 1; head < 2 * LOGSIZE; head += 2) {
XX 		while (atomic_load_acq_int(&tsc_loghead) != head)
XX 			;
XX 		__asm __volatile("lfence");
XX 		logp[head / 2] = rdtsc();
XX 		__asm __volatile("sfence");	/* This reduces variance. */

This reduces the variance from about 4 cycles to 2.

XX 		atomic_store_rel_int(&tsc_loghead, head + 1);
XX 	}

The serialization takes longer than I like -- over 200 cycles and it
doesn't even have a single cache miss for writing to the log buffer.
The fences don't add much to the latency.

XX 
XX 	if (fd != -1) {
XX 		enable_intr();
XX 		close(fd);
XX 	}
XX 	return (NULL);
XX }
XX 
XX int
XX main(int argc, char **argv)
XX {
XX 	cpuset_t all_cpus;
XX 	pthread_t *children;
XX 	double avlatij, avlatji, avskew, sdlatij, sdlatji, sdskew;
XX 	double sumlatijsq, sumlatjisq, sumskewsq;
XX 	int *cpus;
XX 	intmax_t latij, latji, maxlatij, maxlatji, minlatij, minlatji;
XX 	intmax_t sumlatij, sumlatji;
XX 	intmax_t maxskew, minskew, skew;
XX 	int fd, error, i, j, nr, r;
XX 
XX 	/*
XX 	 * Find all the CPUs this program is eligible to run on and use
XX 	 * this as our global set.  This means you can use cpuset to
XX 	 * restrict this program to only run on a subset of CPUs.
XX 	 */
XX 	if (cpuset_getaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, -1,
XX 	    sizeof(all_cpus), &all_cpus) < 0)
XX 		err(1, "cpuset_getaffinity");
XX 	for (ncpu = 0, i = 0; i < CPU_SETSIZE; i++)
XX 		if (CPU_ISSET(i, &all_cpus))
XX 			ncpu++;
XX 	cpus = calloc(ncpu, sizeof(*cpus));
XX 	children = calloc(ncpu, sizeof(*children));
XX 	for (i = 0, j = 0; i < CPU_SETSIZE; i++)
XX 		if (CPU_ISSET(i, &all_cpus)) {
XX 			assert(j < ncpu);
XX 			cpus[j] = i;
XX 			j++;
XX 		}
XX 	if ((fd = open("/dev/io", O_RDONLY)) != -1)
XX 		close(fd);
XX 
XX 	printf(
XX "|   CPUs  |      latij     |      latji     |       skew      |   nr   |\n");
XX 	printf(
XX "|---------+----------------+----------------+-----------------+--------|\n");
XX 	for (i = 0; i < ncpu; i++) {
XX 		for (j = 0; j < ncpu; j++) {
XX 			if (j == i)
XX 				continue;
XX 			bind_cpu(cpus[j]);
XX 			atomic_store_rel_int(&tsc_loghead, -1);
XX 			error = pthread_create(&children[j], NULL, action,
XX 			    (void *)(intptr_t)0);
XX 			if (error)
XX 				errc(1, error, "pthread_create");
XX 			bind_cpu(cpus[i]);
XX 			atomic_store_rel_int(&tsc_loghead, 0);
XX 			action((void *)(intptr_t)1);
XX 			error = pthread_join(children[j], NULL);
XX 			if (error)
XX 				errc(1, error, "pthread_join");
XX 
XX 			nr = 0;
XX 			maxlatij = maxlatji = maxskew = INTMAX_MIN;
XX 			minlatij = minlatji = minskew = INTMAX_MAX;
XX 			sumlatij = sumlatji = 0;
XX 			sumlatijsq = sumlatjisq = sumskewsq = 0;
XX 			/* Skip early entries to let caches warm up. */

The r == 0 entries also have high variance for synchronization.

XX 			for (r = 16; r < LOGSIZE - 1; r++) {
XX 				latij = tsc_log_slave[r] - tsc_log_master[r];
XX 				latji = tsc_log_master[r + 1] -
XX 				    tsc_log_slave[r];

I tried discarding samples lying on cache line boundaries.  This helps for
buffers too large to fit in L1, but doesn't give such a low variance as
with no cache misses at all.

XX 				/*
XX 				 * XXX: discard samples that appear to have
XX 				 * been preempted.
XX 				 */
XX 				if (fd == -1 && (latij > 1999 || latji > 1999))
XX 					continue;

The distribution of latencies on freefall is usually many about 200 cycles,
then many below 1000 cycles for cache misses, then nothing between 1000 and
3000 cycles, then a few for context switches going up to not much more than
30000 cycles.  So above 1999 usually means above 999 and the 3-column
formatting usually works.

XX 				if (maxlatij < latij)
XX 					maxlatij = latij;
XX 				if (minlatij > latij)
XX 					minlatij = latij;
XX 				if (maxlatji < latji)
XX 					maxlatji = latji;
XX 				if (minlatji > latji)
XX 					minlatji = latji;
XX 				sumlatij += latij;
XX 				sumlatji += latji;
XX 				sumlatijsq += latij * latij;
XX 				sumlatjisq += latji * latji;
XX 				skew = latij - latji;
XX 				if (maxskew < skew)
XX 					maxskew = skew;
XX 				if (minskew > skew)
XX 					minskew = skew;
XX 				sumskewsq += skew * skew;
XX 				nr++;
XX 			}
XX 			avlatij = (double)sumlatij / nr;
XX 			sdlatij = sqrt(sumlatijsq / nr - avlatij * avlatij);
XX 			avlatji = (double)sumlatji / nr;
XX 			sdlatji = sqrt(sumlatjisq / nr - avlatji * avlatji);
XX 			avskew = avlatij - avlatji;
XX 			sdskew = sqrt(sumskewsq / nr - avskew * avskew);
XX 			printf(
XX "%2d <-> %2d   %3jd/%3.0f/%3jd/%2.0f   %3jd/%3.0f/%3jd/%2.0f   %4jd/%3.0f/%3jd/%2.0f   %6d\n",
XX 			    cpus[i], cpus[j],
XX 			    minlatij, avlatij, maxlatij, sdlatij,
XX 			    minlatji, avlatji, maxlatji, sdlatji,
XX 			    minskew / 2, avskew / 2, maxskew / 2, sdskew / 2,
XX 			    nr);
XX 		}
XX 	}
XX }

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190324035947.D3250>