From owner-svn-src-all@freebsd.org Fri Mar 22 15:48:24 2019 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0767E154995E; Fri, 22 Mar 2019 15:48:24 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 20CA988716; Fri, 22 Mar 2019 15:48:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 4C1663D93F8; Sat, 23 Mar 2019 02:48:19 +1100 (AEDT) Date: Sat, 23 Mar 2019 02:48:18 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mark Johnston cc: rgrimes@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386 In-Reply-To: <20190321193237.GF8186@raichu> Message-ID: <20190323013628.U2614@besplex.bde.org> References: <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> <20190321180738.P1024@besplex.bde.org> <20190321193237.GF8186@raichu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=UJetJGXy c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=VmXiJ3vLAS_rSwd8VCMA:9 a=PWqCa8tJaKY79Inz:21 a=VyaIJ5dg8WsOghWd:21 a=CjuIK1q_8ugA:10 a=IjZwj45LgO3ly-622nXo:22 X-Rspamd-Queue-Id: 20CA988716 X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.94 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.94)[-0.941,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 22 Mar 2019 15:48:24 -0000 On Thu, 21 Mar 2019, Mark Johnston wrote: > On Thu, Mar 21, 2019 at 10:44:49PM +1100, Bruce Evans wrote: >> On Wed, 20 Mar 2019, Mark Johnston wrote: >> >>> On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote: >>>>> Author: markj >>>>> Date: Thu Mar 21 02:52:22 2019 >>>>> New Revision: 345359 >>>>> URL: https://svnweb.freebsd.org/changeset/base/345359 >>>>> >>>>> Log: >>>>> Don't attempt to measure TSC skew when running as a VM guest. >>>>> >>>>> It simply doesn't work in general since VCPUs may migrate between >>>>> physical cores. The approach used to measure skew also doesn't >>>>> make much sense in a VM. >> >> I think it simply doesn't work. >> ... >> The method is bogus (buggy) out of a VM. >> >> XX static void >> XX dtrace_gethrtime_init(void *arg) >> XX { >> XX struct pcpu *pc; >> XX cpuset_t map; >> XX int i; >> XX #endif >> XX >> XX /* The current CPU is the reference one. */ >> XX sched_pin(); >> >> It is also necessary to disable interrupts in hardware. Or maybe better >> and necessary, detect when they occur and discard related samples. It >> is impossible to disable NMIs. > > Not quite, just carry out the measurement from an NMI handler. :) :-) I can break that by single-stepping it in ddb. That is similar to emulation wit random large stretching of cycles. >> I plan to try a variation of this: all CPUs spin reading the TSC and >> appending the result atomically to a log buffer. Then analyze the results. >> With no skew or interrupts, it is expected that the results are almost >> monotonic. Contention causes non-monotonicity only occasionally. The >> average non-mononicity should be 0. If there is significant skew, then >> the order for the sameples form each pair of CPUs will reverse on almost >> every second reading. The skew is the adjustment needed to stop this >> reversal. I tried this. The free-running method sort of works, but it is better to serialize everything. The result of the serialization is essentially the same as in the fixed tscdrift, though I used a log buffer index instead of a gate variable to synchronize. The method is to calculate the serialization latency in both directions, with each direction using identically code and hopefully identical CPU state after warming up caches. The skew is then the difference of the latency divided by 2. Oops, I forgot to divide by 2. This gives the following output on i4790K: XX | CPUs | latij | latji | skew | nr | XX |---------+----------------+----------------+-----------------+--------| XX 0 <-> 1 72/143/460/ 7 68/143/460/ 5 -316/ 0/416/ 9 524286 latij is latency from CPU i to CPU j. nr is the number of samples (512K rdtsc()s on each CPU in the pair, with a few discards for latencies above 999 cycles discarded. The i4790K was run with interrupts disabled, so I don't know why there were so many discards). The 4 values in each of the 3 middle columns are min/avg/max/sdev. Note: for CPUs on the same core, the latencies are about 120 cycles smaller and the averag skews are very close to 0. This and latencies never being negative show that the method basically works. XX 0 <-> 2 192/263/992/25 192/258/992/16 -724/ 5/748/29 524267 XX 0 <-> 3 192/263/992/25 188/258/992/15 -724/ 5/748/29 524266 XX 0 <-> 4 188/255/964/21 188/273/964/42 -724/-18/752/46 524266 XX 0 <-> 5 188/255/976/20 188/273/976/42 -724/-18/752/45 524274 XX 0 <-> 6 200/281/996/20 200/270/996/14 -728/ 11/752/22 524273 XX 0 <-> 7 200/281/964/20 196/270/964/13 -728/ 10/752/22 524276 There does seem to be some skew. Parts of rest of table collected manually and repeated here for comparison with the above: XX 2 <-> 0 188/258/968/16 192/263/968/25 -768/ -5/760/29 524273 XX 3 <-> 0 188/259/988/15 188/263/988/26 -768/ -5/760/29 524282 XX 4 <-> 0 188/271/996/40 188/255/996/20 -768/ 16/764/45 524275 XX 5 <-> 0 192/271/952/40 188/255/952/20 -768/ 16/768/45 524280 XX 6 <-> 0 196/268/992/16 200/280/992/23 -768/-12/768/27 524276 XX 7 <-> 0 200/268/976/15 196/280/976/22 -768/-12/768/27 524283 The signs of the average skews are reversed perfectly when the direction is reversed, and the values are reversed almost perfectly too. This wasn't so clear on freefall's Xeon with interrupts enabled. Otherwise, the behaviour on freefall is similar. XX 1 <-> 0 72/143/476/10 72/143/476/ 5 -728/ -0/752/11 524286 XX 1 <-> 2 192/263/944/25 192/258/944/15 -728/ 5/752/29 524268 XX 1 <-> 3 192/263/948/25 192/258/948/15 -728/ 5/752/29 524264 XX 1 <-> 4 188/255/996/20 188/274/996/42 -740/-18/760/45 524264 XX 1 <-> 5 188/255/996/21 192/273/996/42 -768/-18/760/45 524272 XX 1 <-> 6 196/280/996/21 196/270/996/14 -768/ 10/760/23 524270 XX 1 <-> 7 196/281/952/20 200/270/952/14 -768/ 11/760/23 524275 XX 2 <-> 0 188/258/968/16 192/263/968/25 -768/ -5/760/29 524273 XX 2 <-> 1 188/258/928/16 192/263/928/25 -768/ -5/760/29 524268 XX 2 <-> 3 72/144/504/ 9 72/144/504/ 5 -768/ 0/760/10 524286 XX 2 <-> 4 184/251/988/17 188/269/988/40 -768/-19/760/43 524275 XX 2 <-> 5 184/251/996/17 184/269/996/40 -768/-18/760/43 524271 XX 2 <-> 6 196/263/904/17 196/266/904/14 -768/ -3/760/22 524278 XX 2 <-> 7 192/264/944/17 200/266/944/14 -768/ -3/760/22 524275 XX 3 <-> 0 188/259/988/15 188/263/988/26 -768/ -5/760/29 524282 XX 3 <-> 1 192/258/996/16 192/263/996/25 -768/ -5/760/29 524270 XX 3 <-> 2 84/143/480/ 8 52/142/480/ 5 -768/ 1/760/ 9 524287 XX 3 <-> 4 184/251/960/17 188/270/960/40 -768/-19/764/43 524273 XX 3 <-> 5 184/251/980/18 188/269/980/40 -768/-19/764/43 524271 XX 3 <-> 6 192/264/940/17 196/266/940/14 -768/ -3/764/21 524276 XX 3 <-> 7 196/263/996/17 200/266/996/14 -768/ -3/764/21 524277 XX 4 <-> 0 188/271/996/40 188/255/996/20 -768/ 16/764/45 524275 XX 4 <-> 1 192/272/968/41 192/255/968/20 -768/ 17/764/45 524279 XX 4 <-> 2 188/268/996/39 188/251/996/16 -768/ 17/768/41 524276 XX 4 <-> 3 188/268/884/38 188/251/884/16 -768/ 16/768/41 524273 XX 4 <-> 5 64/144/640/ 8 52/144/640/ 5 -768/ -1/768/10 524286 XX 4 <-> 6 188/284/880/42 192/258/880/15 -768/ 26/768/44 524282 XX 4 <-> 7 188/284/980/42 184/258/980/15 -768/ 26/768/44 524275 Latencies even seem to be transitive. According to the above, CPU0 is 16-18 cycles behind CPUs 4-5 and 10-12 cycles ahead of CPUs 6-7. 26 in the above is consistent with this (CPU 4 is 25 cycles ahead of CPUs 6-7). XX 5 <-> 0 192/271/952/40 188/255/952/20 -768/ 16/768/45 524280 XX 5 <-> 1 192/271/988/40 184/255/988/20 -768/ 16/768/45 524261 XX 5 <-> 2 188/268/884/39 188/251/884/16 -768/ 17/768/41 524275 XX 5 <-> 3 188/269/972/39 188/251/972/16 -768/ 17/768/41 524273 XX 5 <-> 4 72/143/512/ 8 52/142/512/ 5 -768/ 1/768/10 524287 XX 5 <-> 6 188/282/956/41 192/258/956/15 -768/ 24/768/44 524276 XX 5 <-> 7 188/283/812/42 192/258/812/15 -768/ 25/768/44 524269 XX 6 <-> 0 196/268/992/16 200/280/992/23 -768/-12/768/27 524276 XX 6 <-> 1 200/268/984/16 200/280/984/23 -768/-12/768/27 524266 XX 6 <-> 2 196/266/864/15 196/265/864/16 -768/ 1/768/21 524273 XX 6 <-> 3 196/266/940/15 196/266/940/16 -768/ 0/768/21 524275 XX 6 <-> 4 192/259/964/16 192/287/964/44 -768/-28/768/46 524277 XX 6 <-> 5 188/259/992/16 196/287/992/44 -768/-28/768/47 524274 XX 6 <-> 7 88/143/732/ 8 52/143/732/ 6 -768/ -0/768/10 524287 XX 7 <-> 0 200/268/976/15 196/280/976/22 -768/-12/768/27 524283 XX 7 <-> 1 196/268/996/16 196/279/996/22 -768/-11/768/27 524277 XX 7 <-> 2 196/266/976/15 196/265/976/16 -768/ 1/768/21 524272 XX 7 <-> 3 196/266/988/15 196/265/988/16 -768/ 0/768/21 524266 XX 7 <-> 4 188/259/860/16 196/287/860/44 -768/-28/768/46 524269 XX 7 <-> 5 188/259/928/16 200/287/928/44 -768/-28/768/47 524283 XX 7 <-> 6 72/144/992/ 8 52/142/992/ 5 -852/ 1/768/ 9 524287 > ... > I don't think it makes much sense to measure skew on systems with an > invariant TSC. If the TSC is not invariant, measuring it once at load > time is likely not enough. Based on TSC-invariant systems I looked at, > the measured skew is small enough that the erroneous deltas probably > don't cause much harm. In VMs, the deltas are large enough that they > create noticeable problems. I agree, but dtrace always does it. On a Turion2 system with a non- invariant TSC that is unusable as a timecounter, dtrace init produces: tsc_skew[0] = 0 tsc_skew[1] = -397099 (decimal) and the above program produces garbage (NaNs for division of 0 by 0) due to all latencies being too large to use. After usingall latencies, it gives: | CPUs | latij | latji | skew | nr | |---------+----------------+----------------+-----------------+--------| 0 <-> 1 -1259586/-1259530/1262126/123 1260084/1260136/1262126/126 -2521678/-2519667/ 2/130 524287 1 <-> 0 1260089/1260144/-1257494/122 -1259603/-1259538/-1257494/126 -2521678/2519683/2521496/131 524287 The large latencies and skews don't change noticeably on repeated runs. They are about 1260k. This is very different from the 397k seen by dtrace init. The above output was produced by: XX /** pthread and cpu management parts copied from tscdrift.c. */ XX XX /*- XX * Copyright (c) 2014 Advanced Computing Technologies LLC XX * Written by: John H. Baldwin XX * All rights reserved. XX * XX * Redistribution and use in source and binary forms, with or without XX * modification, are permitted provided that the following conditions XX * are met: XX * 1. Redistributions of source code must retain the above copyright XX * notice, this list of conditions and the following disclaimer. XX * 2. Redistributions in binary form must reproduce the above copyright XX * notice, this list of conditions and the following disclaimer in the XX * documentation and/or other materials provided with the distribution. XX * XX * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND XX * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE XX * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE XX * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE XX * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL XX * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS XX * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) XX * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT XX * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY XX * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF XX * SUCH DAMAGE. XX */ XX XX #include XX #include XX XX #include XX #include XX XX #include XX #include XX #include XX #include XX #include XX #include XX #include XX #include XX #include XX XX #define LOGSIZE 0x100000 XX XX static volatile u_int nready; XX static int ncpu; XX static volatile uint64_t tsc_log[LOGSIZE]; XX static volatile uint tsc_loghead; XX XX /* Bind the current thread to the specified CPU. */ XX static void XX bind_cpu(int cpu) XX { XX cpuset_t set; XX XX CPU_ZERO(&set); XX CPU_SET(cpu, &set); XX if (cpuset_setaffinity(CPU_LEVEL_WHICH, CPU_WHICH_TID, -1, sizeof(set), XX &set) < 0) XX err(1, "cpuset_setaffinity(%d)", cpu); XX } XX XX static void * XX action(void *arg) XX { XX int cpu, fd, head, master; XX XX cpu = (intptr_t)arg; XX master = cpu & 0x80000000; XX cpu &= 0x7fffffff; XX bind_cpu(cpu); XX XX fd = open("/dev/io", O_RDONLY); XX XX /* Attempt to fault in everything before disabling interrupts. */ XX memset((void *)tsc_log, 0, sizeof(tsc_log)); XX XX /* Start logging on all CPUs at almost the same time. */ XX atomic_add_int(&nready, 1); XX while (atomic_load_int(&nready) < 2) XX ; XX XX /* If possible, run throughout with interrupts disabled. */ XX if (fd != -1) XX disable_intr(); XX XX head = master ? 0 : 1; XX for (;;) { XX if (head >= LOGSIZE) XX break; XX while (atomic_load_acq_int(&tsc_loghead) != head) XX ; XX __asm __volatile("lfence"); XX tsc_log[head] = rdtsc(); XX atomic_store_rel_int(&tsc_loghead, head + 1); XX head += 2; XX } XX if (fd != -1) XX enable_intr(); XX return (NULL); XX } XX XX int XX main(int argc, char **argv) XX { XX cpuset_t all_cpus; XX pthread_t *children; XX double avlatij, avlatji, avskew, sdlatij, sdlatji, sdskew; XX double sumlatijsq, sumlatjisq, sumskewsq; XX int *cpus; XX intmax_t latij, latji, maxlatij, maxlatji, minlatij, minlatji; XX intmax_t sumlatij, sumlatji; XX intmax_t skew, maxskew, minskew; XX int error, i, j, nr, r; XX XX /* XX * Find all the CPUs this program is eligible to run on and use XX * this as our global set. This means you can use cpuset to XX * restrict this program to only run on a subset of CPUs. XX */ XX if (cpuset_getaffinity(CPU_LEVEL_WHICH, CPU_WHICH_PID, -1, XX sizeof(all_cpus), &all_cpus) < 0) XX err(1, "cpuset_getaffinity"); XX for (ncpu = 0, i = 0; i < CPU_SETSIZE; i++) XX if (CPU_ISSET(i, &all_cpus)) XX ncpu++; XX cpus = calloc(ncpu, sizeof(*cpus)); XX children = calloc(ncpu, sizeof(*children)); XX for (i = 0, j = 0; i < CPU_SETSIZE; i++) XX if (CPU_ISSET(i, &all_cpus)) { XX assert(j < ncpu); XX cpus[j] = i; XX j++; XX } XX XX printf( XX "| CPUs | latij | latji | skew | nr |\n"); XX printf( XX "|---------+----------------+----------------+-----------------+--------|\n"); XX for (i = 0; i < ncpu; i++) { XX for (j = 0; j < ncpu; j++) { XX if (j == i) XX continue; XX atomic_store_rel_int(&nready, 0); XX atomic_store_rel_int(&tsc_loghead, 0); XX error = pthread_create(&children[j], NULL, action, XX (void *)(intptr_t)cpus[j]); XX if (error) XX errc(1, error, "pthread_create"); XX action((void *)(intptr_t)(0x80000000 | cpus[i])); XX error = pthread_join(children[j], NULL); XX if (error) XX errc(1, error, "pthread_join"); XX XX nr = 0; XX maxlatij = maxlatji = INTMAX_MIN; XX minlatij = minlatji = INTMAX_MAX; XX sumlatij = sumlatji = 0; XX sumlatijsq = sumlatjisq = sumskewsq = 0; XX for (r = 0; r < LOGSIZE - 2; r += 2) { XX latij = tsc_log[r + 1] - tsc_log[r]; XX latji = tsc_log[r + 2] - tsc_log[r + 1]; XX #if 1 /* XXX discard samples which appear to have been preempted */ XX if (latij > 999 || latji > 999) XX continue; XX #endif XX if (maxlatij < latij) XX maxlatij = latij; XX if (minlatij > latij) XX minlatij = latij; XX if (maxlatji < latji) XX maxlatji = latji; XX if (minlatji > latji) XX minlatji = latji; XX sumlatij += latij; XX sumlatji += latji; XX sumlatijsq += latij * latij; XX sumlatjisq += latji * latji; XX skew = latij - latji; XX if (maxskew < skew) XX maxskew = skew; XX if (minskew > skew) XX minskew = skew; XX sumskewsq += skew * skew; XX nr++; XX } XX avlatij = (double)sumlatij / nr; XX sdlatij = sqrt(sumlatijsq / nr - avlatij * avlatij); XX avlatji = (double)sumlatji / nr; XX sdlatji = sqrt(sumlatjisq / nr - avlatji * avlatji); XX avskew = avlatij - avlatji; XX sdskew = sqrt(sumskewsq / nr - avskew * avskew); XX printf( XX "%2d <-> %2d %3jd/%3.0f/%3jd/%2.0f %3jd/%3.0f/%3jd/%2.0f %4jd/%3.0f/%3jd/%2.0f %6d\n", XX cpus[i], cpus[j], XX minlatij, avlatij, maxlatji, sdlatij, XX minlatji, avlatji, maxlatji, sdlatji, XX minskew, avskew, maxskew, sdskew, nr); XX } XX } XX } Unfortunately, the measured skews change a lot. Some runs gave CPUs 2-3 at skew average distance -105 from CPU 0, but CPUs 4-7 at distance +-0, And that is with 512K samples. I expected averaging over a few hundred samples would be enough. Bruce