Date: Sun, 24 Mar 2019 05:05:44 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: Mark Johnston <markj@freebsd.org>, rgrimes@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386 Message-ID: <20190324035947.D3250@besplex.bde.org> In-Reply-To: <20190323013628.U2614@besplex.bde.org> References: <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> <20190321180738.P1024@besplex.bde.org> <20190321193237.GF8186@raichu> <20190323013628.U2614@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Mar 2019, Bruce Evans wrote: I now have an almost perfect skew measuring program. It has a variance of about 2 cycles. Output on i4790k (only for CPU 0 vs CPU [1-7], since nothing different happens for the general CPU i vs CPU j comparison): XX (min/avg/max/sd) XX | CPUs | latij | latji | skew | nr | XX |---------+----------------+----------------+-----------------+--------| XX 0 <-> 1 128/145/152/ 3 132/145/156/ 3 -12/ -0/ 6/ 2 239 XX 0 <-> 2 216/222/224/ 2 216/218/220/ 2 -2/ 2/ 4/ 1 239 XX 0 <-> 3 216/222/224/ 2 216/218/220/ 2 -2/ 2/ 4/ 1 239 XX 0 <-> 4 220/224/228/ 4 220/224/228/ 4 -4/ 0/ 4/ 3 239 XX 0 <-> 5 224/225/228/ 2 220/223/228/ 3 -2/ 1/ 2/ 2 239 XX 0 <-> 6 220/226/232/ 2 224/227/232/ 2 -4/ -1/ 2/ 1 239 XX 0 <-> 7 224/226/228/ 2 224/227/228/ 2 -2/ -1/ 0/ 1 239 The main bug in the previous version was that it measured mainly the variance of IPC latency due to cache misses. Cache misses apparently sometimes cost 1400 cycles on i4790k. This was shown by a max vs min diference of about 1400 cycles for latij and latji (the min was about 200 as above, but the max was about 1600). In the above, the difference between the max and the min for latij is at most 24. The larger ranges (though no larger variances) for the 0 <-> 1 case are probably not accidental. This case shares a core. Masking interrupts should prevent HTT contention in the other cases. This depends on disabling interrupts or the system being fairly idle. With interrupts enabled and the system not idle, on freefall's Xeon: XX | CPUs | latij | latji | skew | nr | XX |---------+----------------+----------------+-----------------+--------| XX 0 <-> 1 422/482/514/15 423/480/502/12 -39/ 1/ 20/10 239 XX 0 <-> 2 679/691/1340/43 732/744/746/ 5 -33/-27/304/22 238 XX 0 <-> 3 679/688/707/ 8 732/744/746/ 5 -33/-28/-12/ 6 239 XX 0 <-> 4 713/737/753/ 5 726/726/726/ 0 -6/ 5/ 13/ 3 239 XX 0 <-> 5 726/738/964/15 726/726/726/ 0 0/ 6/119/ 8 238 XX 0 <-> 6 686/695/713/13 765/768/778/ 5 -45/-36/-26/ 7 239 XX 0 <-> 7 686/696/713/12 765/768/779/ 5 -46/-36/-26/ 7 239 I just noticed that the minimum latencies exploded from 180-220 in previous versions to 679-726 here. Another run gave the old latencies: XX | CPUs | latij | latji | skew | nr | XX |---------+----------------+----------------+-----------------+--------| XX 0 <-> 1 114/133/150/ 4 120/133/147/ 2 -9/ -0/ 9/ 2 239 XX 0 <-> 2 216/220/222/ 3 213/216/225/ 3 -4/ 2/ 4/ 3 239 XX 0 <-> 3 216/220/222/ 3 213/216/219/ 3 -1/ 2/ 4/ 3 239 XX 0 <-> 4 204/224/234/12 201/220/231/13 -13/ 2/ 16/11 239 XX 0 <-> 5 204/224/234/12 201/220/231/13 -13/ 2/ 16/11 239 XX 0 <-> 6 219/232/246/13 228/229/231/ 1 -6/ 1/ 9/ 7 239 XX 0 <-> 7 219/232/246/13 228/230/231/ 1 -6/ 1/ 9/ 7 239 The skews of -27 through -36 in the first run are probably real. I think rdtsc uses a common 100 MHz clock. Differences of 1 cycle in that give differences of 33 cycles for rdtsc. Apparently the firmware doesn't do a good job of hiding this granularity. The error of 1 cycle at 100 MHz persists for many TSC cycles (for more like 4000 samples than the 239 used here), and across most CPU pairs in a run. i4790K shows the same behaviour of skews of 1 cycle at 100 MHz, but less often (once in every 10-20 runs): XX | CPUs | latij | latji | skew | nr | XX |---------+----------------+----------------+-----------------+--------| XX 0 <-> 1 116/146/156/ 5 128/145/152/ 4 -16/ 1/ 12/ 3 239 XX 0 <-> 2 172/182/280/ 7 256/261/596/24 -206/-39/ 8/12 237 XX 0 <-> 3 176/182/184/ 3 256/258/260/ 2 -42/-38/-36/ 2 239 The TSC frequency is 4.08 GHz, so an error of 1 cycle at 100 MHz is 40.8 rdtsc cycles. XX 0 <-> 4 220/223/228/ 2 220/225/232/ 4 -4/ -1/ 4/ 2 239 XX 0 <-> 5 220/223/228/ 3 224/225/228/ 2 -2/ -1/ 2/ 1 239 XX 0 <-> 6 220/226/228/ 2 224/227/228/ 2 -4/ -1/ 2/ 1 239 XX 0 <-> 7 220/226/232/ 2 224/227/232/ 2 -4/ -1/ 2/ 1 239 Now full output to show many more skews near 41: XX 1 <-> 0 140/145/152/ 2 132/148/156/ 3 -4/ -2/ 8/ 2 239 XX 1 <-> 2 172/185/228/ 9 256/261/552/19 -184/-38/-16/11 239 XX 1 <-> 3 180/182/184/ 2 256/258/260/ 2 -40/-38/-36/ 2 239 XX 1 <-> 4 220/223/224/ 2 220/225/232/ 4 -4/ -1/ 2/ 2 239 XX 1 <-> 5 220/223/228/ 2 220/225/232/ 4 -4/ -1/ 4/ 2 239 XX 1 <-> 6 220/226/232/ 3 224/227/232/ 2 -4/ -1/ 2/ 1 239 XX 1 <-> 7 220/226/232/ 3 224/228/232/ 2 -6/ -1/ 4/ 2 239 XX 2 <-> 0 256/258/260/ 2 180/182/184/ 2 36/ 38/ 40/ 1 239 XX 2 <-> 1 256/258/260/ 2 176/182/184/ 2 36/ 38/ 42/ 2 239 XX 2 <-> 3 148/149/152/ 2 140/144/148/ 3 0/ 3/ 6/ 2 239 XX 2 <-> 4 244/253/256/ 4 168/173/180/ 2 34/ 40/ 42/ 2 239 XX 2 <-> 5 244/252/256/ 4 172/173/176/ 2 36/ 40/ 42/ 2 239 XX 2 <-> 6 248/254/260/ 3 172/176/180/ 2 36/ 39/ 42/ 2 239 XX 2 <-> 7 244/254/260/ 4 172/177/180/ 3 32/ 38/ 42/ 3 239 XX 3 <-> 0 256/258/260/ 2 176/182/184/ 2 36/ 38/ 42/ 2 239 XX 3 <-> 1 256/258/260/ 2 180/182/184/ 2 36/ 38/ 40/ 1 239 XX 3 <-> 2 140/145/152/ 3 144/148/152/ 2 -4/ -2/ 4/ 2 239 XX 3 <-> 4 244/253/256/ 4 168/173/180/ 2 34/ 40/ 42/ 2 239 XX 3 <-> 5 244/253/256/ 4 168/173/180/ 2 34/ 40/ 42/ 2 239 XX 3 <-> 6 248/254/260/ 3 172/176/180/ 2 36/ 39/ 42/ 2 239 XX 3 <-> 7 248/254/260/ 3 172/176/180/ 2 36/ 39/ 42/ 2 239 XX 4 <-> 0 220/224/228/ 2 220/224/228/ 2 -2/ -0/ 2/ 1 239 XX 4 <-> 1 220/225/228/ 3 220/223/228/ 3 -2/ 1/ 4/ 2 239 XX 4 <-> 2 168/173/180/ 2 244/253/256/ 4 -44/-40/-34/ 2 239 XX 4 <-> 3 168/173/180/ 2 244/253/256/ 4 -44/-40/-34/ 2 239 XX 4 <-> 5 128/145/152/ 3 140/145/152/ 2 -8/ -0/ 6/ 2 239 XX 4 <-> 6 212/218/220/ 3 212/218/220/ 3 -4/ 0/ 4/ 3 239 XX 4 <-> 7 212/220/224/ 3 208/216/220/ 3 -4/ 2/ 4/ 2 239 XX 5 <-> 0 220/225/228/ 3 220/223/228/ 3 -4/ 1/ 4/ 3 239 XX 5 <-> 1 220/224/228/ 2 220/224/228/ 2 -2/ -0/ 2/ 1 239 XX 5 <-> 2 168/173/180/ 2 244/253/256/ 4 -44/-40/-34/ 2 239 XX 5 <-> 3 168/173/180/ 2 244/253/256/ 4 -44/-40/-34/ 2 239 XX 5 <-> 4 144/146/148/ 2 144/151/156/ 4 -6/ -3/ 2/ 2 239 XX 5 <-> 6 212/220/224/ 3 208/216/220/ 3 -4/ 2/ 4/ 2 239 XX 5 <-> 7 212/218/220/ 3 212/218/220/ 3 -4/ 0/ 4/ 3 239 XX 6 <-> 0 224/227/232/ 2 220/226/232/ 2 -2/ 1/ 6/ 2 239 XX 6 <-> 1 224/227/232/ 2 220/226/232/ 2 -2/ 1/ 6/ 2 239 XX 6 <-> 2 172/177/180/ 3 244/254/260/ 4 -42/-38/-32/ 3 239 XX 6 <-> 3 172/177/180/ 3 244/254/260/ 4 -42/-38/-32/ 3 239 XX 6 <-> 4 208/216/220/ 3 216/220/224/ 2 -4/ -2/ 2/ 2 239 XX 6 <-> 5 212/218/220/ 3 212/218/220/ 3 -4/ -0/ 4/ 2 239 XX 6 <-> 7 128/147/152/ 4 128/145/156/ 4 -10/ 1/ 6/ 3 239 XX 7 <-> 0 224/227/232/ 2 220/226/228/ 2 -2/ 1/ 6/ 2 239 XX 7 <-> 1 224/227/232/ 2 220/226/228/ 2 -2/ 1/ 6/ 2 239 XX 7 <-> 2 172/176/180/ 2 248/254/260/ 3 -42/-39/-34/ 2 239 XX 7 <-> 3 172/177/180/ 3 244/254/260/ 4 -42/-38/-32/ 3 239 XX 7 <-> 4 208/216/220/ 3 216/220/224/ 2 -4/ -2/ 2/ 2 239 XX 7 <-> 5 212/218/220/ 3 212/218/220/ 3 -4/ -0/ 4/ 2 239 XX 7 <-> 6 132/145/152/ 3 116/147/152/ 4 -8/ -1/ 14/ 2 239 The error of ~41 cycles showed up in 24 out of 56 independent samples (independent except for being sequential, but spread over at least 100k TSC cycles. The variance for all these ~41 cycle skews is the same 2 or 3 as for the small skews, so I think the large skews are not measurement errors. This output was produced by: XX /* pthread and cpu management parts copied from tscdrift.c. */ XX XX /*- XX * Copyright (c) 2014 Advanced Computing Technologies LLC XX * Written by: John H. Baldwin <jhb@FreeBSD.org> XX * All rights reserved. XX * XX * Redistribution and use in source and binary forms, with or without XX * modification, are permitted provided that the following conditions XX * are met: XX * 1. Redistributions of source code must retain the above copyright XX * notice, this list of conditions and the following disclaimer. XX * 2. Redistributions in binary form must reproduce the above copyright XX * notice, this list of conditions and the following disclaimer in the XX * documentation and/or other materials provided with the distribution. XX * XX * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND XX * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE XX * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE XX * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE XX * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL XX * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS XX * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) XX * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT XX * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY XX * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF XX * SUCH DAMAGE. XX */ XX XX #include <sys/param.h> XX #include <sys/cpuset.h> XX XX #include <machine/atomic.h> XX #include <machine/cpufunc.h> XX XX #include <assert.h> XX #include <err.h> XX #include <fcntl.h> XX #include <inttypes.h> XX #include <math.h> XX #include <pthread.h> XX #include <stdio.h> XX #include <stdlib.h> XX #include <string.h> XX XX /* LOGSIZE should be small enough for everything to fit in L1. */ XX #define LOGSIZE 0x100 XX XX static int ncpu; XX /* XX * The sepatrate log buffers are to reduce cache contention. This is not XX * so important with everything in L1. XX */ XX static uint64_t tsc_log_master[LOGSIZE] __aligned(PAGE_SIZE); XX static uint64_t tsc_log_slave[LOGSIZE] __aligned(PAGE_SIZE); XX static volatile u_int tsc_loghead; The main fixes are here. By keeping LOGSIZE significantly less than 4K, the working set is smaller than 32K so fits in L1. Alignment to PAGE_SIZE and separate log buffers are refinements of this. There are a few more smaller refinements and some bug fixes. XX XX /* Bind the current thread to the specified CPU. */ XX static void XX bind_cpu(int cpu) XX { XX cpuset_t set; XX XX CPU_ZERO(&set); XX CPU_SET(cpu, &set); XX if (cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(set), XX &set) < 0) XX err(1, "cpuset_setaffinity(%d)", cpu); XX } XX XX static void * XX action(void *arg) XX { XX uint64_t *logp; XX int fd, head, master; XX XX master = (intptr_t)arg; XX logp = master ? tsc_log_master : tsc_log_slave; XX XX /* Attempt to fault in everything before disabling interrupts. */ XX memset(logp, 0, sizeof(tsc_log_master)); XX XX /* XX * Wait for the master to bind to its (necessarily different) CPU XX * before disabling interrupts. XX */ XX while (atomic_load_acq_int(&tsc_loghead) == -1) XX ; XX XX /* If possible, run the main loop with interrupts disabled. */ XX if ((fd = open("/dev/io", O_RDONLY)) != -1) XX disable_intr(); XX XX /* Attempt to cache everything again after possible preemption. */ XX memset(logp, 0, sizeof(tsc_log_master)); The last line is a refinement. XX XX /* XX * (head % 1) selects the log buffer (0 = master) and (head / 2) is XX * the index in this buffer. XX */ XX for (head = master ? 0 : 1; head < 2 * LOGSIZE; head += 2) { XX while (atomic_load_acq_int(&tsc_loghead) != head) XX ; XX __asm __volatile("lfence"); XX logp[head / 2] = rdtsc(); XX __asm __volatile("sfence"); /* This reduces variance. */ This reduces the variance from about 4 cycles to 2. XX atomic_store_rel_int(&tsc_loghead, head + 1); XX } The serialization takes longer than I like -- over 200 cycles and it doesn't even have a single cache miss for writing to the log buffer. The fences don't add much to the latency. XX XX if (fd != -1) { XX enable_intr(); XX close(fd); XX } XX return (NULL); XX } XX XX int XX main(int argc, char **argv) XX { XX cpuset_t all_cpus; XX pthread_t *children; XX double avlatij, avlatji, avskew, sdlatij, sdlatji, sdskew; XX double sumlatijsq, sumlatjisq, sumskewsq; XX int *cpus; XX intmax_t latij, latji, maxlatij, maxlatji, minlatij, minlatji; XX intmax_t sumlatij, sumlatji; XX intmax_t maxskew, minskew, skew; XX int fd, error, i, j, nr, r; XX XX /* XX * Find all the CPUs this program is eligible to run on and use XX * this as our global set. This means you can use cpuset to XX * restrict this program to only run on a subset of CPUs. XX */ XX if (cpuset_getaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, -1, XX sizeof(all_cpus), &all_cpus) < 0) XX err(1, "cpuset_getaffinity"); XX for (ncpu = 0, i = 0; i < CPU_SETSIZE; i++) XX if (CPU_ISSET(i, &all_cpus)) XX ncpu++; XX cpus = calloc(ncpu, sizeof(*cpus)); XX children = calloc(ncpu, sizeof(*children)); XX for (i = 0, j = 0; i < CPU_SETSIZE; i++) XX if (CPU_ISSET(i, &all_cpus)) { XX assert(j < ncpu); XX cpus[j] = i; XX j++; XX } XX if ((fd = open("/dev/io", O_RDONLY)) != -1) XX close(fd); XX XX printf( XX "| CPUs | latij | latji | skew | nr |\n"); XX printf( XX "|---------+----------------+----------------+-----------------+--------|\n"); XX for (i = 0; i < ncpu; i++) { XX for (j = 0; j < ncpu; j++) { XX if (j == i) XX continue; XX bind_cpu(cpus[j]); XX atomic_store_rel_int(&tsc_loghead, -1); XX error = pthread_create(&children[j], NULL, action, XX (void *)(intptr_t)0); XX if (error) XX errc(1, error, "pthread_create"); XX bind_cpu(cpus[i]); XX atomic_store_rel_int(&tsc_loghead, 0); XX action((void *)(intptr_t)1); XX error = pthread_join(children[j], NULL); XX if (error) XX errc(1, error, "pthread_join"); XX XX nr = 0; XX maxlatij = maxlatji = maxskew = INTMAX_MIN; XX minlatij = minlatji = minskew = INTMAX_MAX; XX sumlatij = sumlatji = 0; XX sumlatijsq = sumlatjisq = sumskewsq = 0; XX /* Skip early entries to let caches warm up. */ The r == 0 entries also have high variance for synchronization. XX for (r = 16; r < LOGSIZE - 1; r++) { XX latij = tsc_log_slave[r] - tsc_log_master[r]; XX latji = tsc_log_master[r + 1] - XX tsc_log_slave[r]; I tried discarding samples lying on cache line boundaries. This helps for buffers too large to fit in L1, but doesn't give such a low variance as with no cache misses at all. XX /* XX * XXX: discard samples that appear to have XX * been preempted. XX */ XX if (fd == -1 && (latij > 1999 || latji > 1999)) XX continue; The distribution of latencies on freefall is usually many about 200 cycles, then many below 1000 cycles for cache misses, then nothing between 1000 and 3000 cycles, then a few for context switches going up to not much more than 30000 cycles. So above 1999 usually means above 999 and the 3-column formatting usually works. XX if (maxlatij < latij) XX maxlatij = latij; XX if (minlatij > latij) XX minlatij = latij; XX if (maxlatji < latji) XX maxlatji = latji; XX if (minlatji > latji) XX minlatji = latji; XX sumlatij += latij; XX sumlatji += latji; XX sumlatijsq += latij * latij; XX sumlatjisq += latji * latji; XX skew = latij - latji; XX if (maxskew < skew) XX maxskew = skew; XX if (minskew > skew) XX minskew = skew; XX sumskewsq += skew * skew; XX nr++; XX } XX avlatij = (double)sumlatij / nr; XX sdlatij = sqrt(sumlatijsq / nr - avlatij * avlatij); XX avlatji = (double)sumlatji / nr; XX sdlatji = sqrt(sumlatjisq / nr - avlatji * avlatji); XX avskew = avlatij - avlatji; XX sdskew = sqrt(sumskewsq / nr - avskew * avskew); XX printf( XX "%2d <-> %2d %3jd/%3.0f/%3jd/%2.0f %3jd/%3.0f/%3jd/%2.0f %4jd/%3.0f/%3jd/%2.0f %6d\n", XX cpus[i], cpus[j], XX minlatij, avlatij, maxlatij, sdlatij, XX minlatji, avlatji, maxlatji, sdlatji, XX minskew / 2, avskew / 2, maxskew / 2, sdskew / 2, XX nr); XX } XX } XX } Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190324035947.D3250>