From owner-svn-src-head@freebsd.org Thu Mar 21 11:45:02 2019 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1C711153548C; Thu, 21 Mar 2019 11:45:02 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 2C2068763A; Thu, 21 Mar 2019 11:45:00 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from [192.168.0.102] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au [110.21.101.228]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 990C9105BAF7; Thu, 21 Mar 2019 22:44:50 +1100 (AEDT) Date: Thu, 21 Mar 2019 22:44:49 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mark Johnston cc: rgrimes@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386 In-Reply-To: <20190321034917.GA8186@raichu> Message-ID: <20190321180738.P1024@besplex.bde.org> References: <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=FNpr/6gs c=1 sm=1 tr=0 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=ZectpuNWII6CirRfMfEA:9 a=ADR4nIhWhHMYY5YO:21 a=ILTXBW-eLHiTwD_C:21 a=CjuIK1q_8ugA:10 a=IjZwj45LgO3ly-622nXo:22 X-Rspamd-Queue-Id: 2C2068763A X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.97 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.97)[-0.970,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Mar 2019 11:45:02 -0000 On Wed, 20 Mar 2019, Mark Johnston wrote: > On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote: >>> Author: markj >>> Date: Thu Mar 21 02:52:22 2019 >>> New Revision: 345359 >>> URL: https://svnweb.freebsd.org/changeset/base/345359 >>> >>> Log: >>> Don't attempt to measure TSC skew when running as a VM guest. >>> >>> It simply doesn't work in general since VCPUs may migrate between >>> physical cores. The approach used to measure skew also doesn't >>> make much sense in a VM. I think it simply doesn't work. jhb's src/tools/tools/tscdrift/tscdrift.c also simply doesn't work for measuring drift. It mostly measures IPC latency and jitter from preemption in attempts to measure skew. It might be possible to determine the current skew using a statistical analysis of the latency and a few more measurements (measure latency between each CPU in a pair instead of only from CPU0 to other CPUs), but it would be better to do more direct measurements. >> "May" is the important aspect here, and to my knowledge both >> bhyve and Vmware provide a way to pin cpus there should be >> a way for us to turn this back on if it is desired. Sticking >> it under the big knob vm_guest is probably not the most flexiable >> solution. >> >> Could we please have someway to force this back on? > > Even with pinning the skew computation is bogus in a VM. On an idle > host running bhyve with pinned vCPUs it gives offsets that are several > orders of magnitude larger than on the host. I would prefer to see a > specific reason for wanting the old behaviour before adding a knob. The method is bogus (buggy) out of a VM. XX static void XX dtrace_gethrtime_init(void *arg) XX { XX struct pcpu *pc; XX cpuset_t map; XX int i; XX #endif XX XX /* The current CPU is the reference one. */ XX sched_pin(); It is also necessary to disable interrupts in hardware. Or maybe better and necessary, detect when they occur and discard related samples. It is impossible to disable NMIs. Then there are SMIs. Not to mention virtualization. At best, it gives delays a bit like for SMIs, but longer. Since this is init code, perhaps interrupts are already masked in hardware, but to detect drift in the skews, it is necessary to run a routine like this quite often. XX tsc_skew[curcpu] = 0; XX CPU_FOREACH(i) { XX if (i == curcpu) XX continue; XX XX pc = pcpu_find(i); XX CPU_SETOF(PCPU_GET(cpuid), &map); XX CPU_SET(pc->pc_cpuid, &map); XX XX smp_rendezvous_cpus(map, NULL, XX dtrace_gethrtime_init_cpu, XX smp_no_rendezvous_barrier, (void *)(uintptr_t) i); Some synchronization is necessary, but this is very heavyweight. Even the lightweight synchronization in tscdrift.c has an average latency of about 150-200 cycles on freefall's Xeon: afer fixing bugs in it: CPU | TSC skew (min/avg/max/stddev) ----+------------------------------ 0 | 0 0 0 0.000 1 | 36 65 47913 101.227 # lower latency on same core 2 | 144 180 52977 139.862 3 | 147 180 34077 113.543 4 | 135 181 49335 228.104 5 | 123 175 33300 91.703 6 | 165 188 129825 166.999 7 | 165 189 248010 278.321 tscdrift measures from CPU0 to CPU_n for each n. After fixing it to read the TSC on CPU0 before doing an IPC to allow reading the TSC on CPU_n, this measures mainly the IPC latency. Actually reading the TSC takes about 24 cycles. The skew appears to be 0 for all CPUs. Anyway, it is so much smaller than the IPC latency that it can be ignored, provided it doesn't drift to become non-small, and applications don't try to be too smart about correcting for the latency. The unfixed tscdrift allows the other CPU to read the TSC before reading it itself. This is more like the above. It still ends up measuring mainly IPC latency and other bugs in the program, since the read of the TSC on the same CPU occurs relatively instantly without the other bugs, and out of order (this possibly earlier than instantly) with the other bugs. The IPC delay is about the same as above, but it takes more complicated statistics to prove this. XX XX tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc; This has even more complicated ordering. The IPC delay depends on the internals of smp_rendezvous_cpus(). It should work better to schedule comparisons of 2 CPUs using a third CPU (this is possible except with precisely 2 CPUs). 2 CPUs sit waiting for permission to read their TSC. Another CPU sets a variable to grant permission. I plan to try a variation of this: all CPUs spin reading the TSC and appending the result atomically to a log buffer. Then analyze the results. With no skew or interrupts, it is expected that the results are almost monotonic. Contention causes non-monotonicity only occasionally. The average non-mononicity should be 0. If there is significant skew, then the order for the sameples form each pair of CPUs will reverse on almost every second reading. The skew is the adjustment needed to stop this reversal. Logs of the TSC for a single CPU show that the granularity is about 18 on freefall's Xeon. I don't know yet if this is more closely related to the clock rate or the rdtsc latency. The TSC frequency is 3.3GHz. I recently learned that a common implementation uses a shared clock of frequency only 100 MHz. With no interpolation of this, the granularity is 33 cycles. rdtsc itself takes about 18 cycles, so not much is gained from any interpolation, but some seems to be done. On Haswell i4790K at 4 GHz, IIRC rdtsc seemed to take 24 cycles. I haven't done the test to see if non-multiples of 24 are returned. A 100 MHz clock would have a granularity of 40 cycles, so there is apparently some interpolation. A commit log message by mav@ says that some Skylakes never return odd values for rdtsc, and some appear to have a granularity of about 186 (?) cycles. freefall's Xeon does return odd values for rdtsc. XX } XX sched_unpin(); XX } XX ... XX /* XX * DTrace needs a high resolution time function which can XX * be called from a probe context and guaranteed not to have XX * instrumented with probes itself. XX * XX * Returns nanoseconds since boot. XX */ dtrace does too much hand-rolling here, but it has to do something to handle the 2 problems mentioned in the comment. Other subsystems handle this worse (*). But it shouldn't try harder than the TSC timecounters to handle skew. The TSC timecounter ignores the problem of skew, but does other things more carefully. XX uint64_t XX dtrace_gethrtime(void) XX { XX uint64_t tsc; XX uint32_t lo, hi; XX register_t rflags; XX XX /* XX * We split TSC value into lower and higher 32-bit halves and separately XX * scale them with nsec_scale, then we scale them down by 2^28 XX * (see nsec_scale calculations) taking into account 32-bit shift of XX * the higher half and finally add. XX */ XX rflags = intr_disable(); XX tsc = rdtsc() - tsc_skew[curcpu]; XX intr_restore(rflags); This is missing the fencing needed to serialize rdtsc(). Unfixed tscdrift is missing fencing in one place. Not serializing rdtsc() gives errors of up to about 80 cycles in variations of tscdrift. These errors are larger in the place where the fencing is _not_ missing, due to the details of the synchronization: While spinning waiting for the "gate" to open, without fencing rdtsc() is executed out of order up to about 80 cycles early. But the "master" usually doesn't spin for so long, so its unfenced rdtsc (in the unfixed version is usually not the full 80 cycles early. The "slave" usually spins for more than 80 cycles, but its rdtsc is fenced so it is never early. This gives a bias towards increasing the measured difference slave_tsc - master_tsc. The average difference is over 200 instead of about 180. Disabling interrupts in the above seems to be to prevent curcpu changing. The comment says nothing about this, or about anything in the block of code that it is attached to. XX XX lo = tsc; XX hi = tsc >> 32; XX return (((lo * nsec_scale) >> SCALE_SHIFT) + XX ((hi * nsec_scale) << (32 - SCALE_SHIFT))); The comment applies to this block if code which it is separated from. XX } I think subtracting the skew in the above is worse than useless. It works OK only when the skew is 0. Suppose that the actual skew is 0. It must be 0, else more important things like the TSC timecounter are broken. It is surprising that the TSC timecounter works so well. Since it does seem to work, the skew must be 0 initially and remain 0 on most systems, except probably for jitter in the 1/(100 MHz) range from the common clock and distributing this clock over lareg distances. But by attempting to caclulate the skew, we risk calculating only IPC delays. The rendezvous function wasn't designed for getting the precise timing need to minimize these delays or wven to know what they are. I looked at it a bit. Indeed, it has a master-slave relationship that cannot work as needed: XX /* Pass rendezvous parameters via global variables. */ XX smp_rv_ncpus = ncpus; XX smp_rv_setup_func = setup_func; XX smp_rv_action_func = action_func; XX smp_rv_teardown_func = teardown_func; XX smp_rv_func_arg = arg; XX smp_rv_waiters[1] = 0; XX smp_rv_waiters[2] = 0; XX smp_rv_waiters[3] = 0; XX atomic_store_rel_int(&smp_rv_waiters[0], 0); The slaves are going to have a lot of latency from just parsing this. XX XX /* XX * Signal other processors, which will enter the IPI with XX * interrupts off. XX */ XX curcpumap = CPU_ISSET(curcpu, &map); XX CPU_CLR(curcpu, &map); XX ipi_selected(map, IPI_RENDEZVOUS); IPIs give large, variable latency. XX /* Check if the current CPU is in the map */ XX if (curcpumap != 0) XX smp_rendezvous_action(); We just sent IPIs to the other CPUs. They may have executed the action function already, but more likely they have barely seen the IPI yet (it is not even an NMI). Here we execute the action function for curcpu. It just reads the TSC for the current CPU. At least the execution is complicated enough that the rdtsc in the action function is hard to execute far out of order, so it doesn't really need fencing. XX [...] No problem waiting for the other CPUs. The problem is getting them to call the action function with tiny or at least known delays. I think this ends up caclulating IPC delays much like tscdrift, but larger due to the very heavyweight IPI method (errors in microseconds instead of only of 180/3.3 = 50 nanoseconds?). Other CPUs on the same core are closer, so the errors from the IPC delays for them are smaller, but probably the IPI delays dominate the errors. So attempting to calculate skews seem to turn skews of 0 into large errors: - no error for the reference CPU - an error of approx. -(IPI_delay + memory_IPC_delay + {1 or 2} * rdtsc_latency + rdtsc_nonserialization_errors), where memory_IPC_delays is much larger for CPUs not on the same core as the "master". Such enormous errors might not be too bad. Times read on different CPUs are just incomparable without establishing an order based on something like reference events involving a locked variable. IPCs for establishing an order take about as long as memory_IPC_delay for the IPCs to miscalculate the skews. (*) Other subsystems with worse timestamps: - KTR uses KTR_TIME which defaults to get_cyclecount(). get_cyclecount() is by definition unusable for timestamps. It was intended to work like an x86 TSC, back when the x86 TSC actually was a cycle count of CPU cycles, with possibly-variable frequency and the hardware not trying very hard to make it comparable across CPUs. Since then the TSC has been decoupled from CPU cycles, so it actually counts time cycles like its name indicates. get_cyclecount() is still documented as counting CPU cycles and not guaranteeing anything except per-CPU monotonicity. On x86, get_cyclecount() usually reduces to rdtsc() (with no fencing but a layer or 2 of function calls). But it can reduce to a timecounter, so is unusuable in KTR for another reason -- that KTR needs to be able to timestamp low-level things like timecounters. Since get_cyclecount() can be anything and might have a variable frequency, its current average frequency is unknown. KTR provids no conversion functions to help the user intepret KTR logs with timestamps. - kern_tslog.c:tslog() uses get_cyclecount() without even a macro for replacing it be a usable counter, and spells its variable 'tsc' - get_cyclecount() sometimes uses a timecounter - get_cyclecount() is also misused in: - dev/ocs_fc/ocs_utils.c (like kern_tslog.c) - dev/hwpmc/hwpmc_mod.c (fallback from rdtsc*() for non-x86; also uses rdtscp() if available and is missing fencing if only rdtstc() is available - dev/de/if_devar.h (it wants rdtsc() but that is even less portable. nanouptime() is adequate for NOC drivers). - dev/random/* (this was the reason for the get_cyclecount() mistake. Randomness wants the opposite of a monotonic counter, but there is some noise in the low bits of x86 TSCs) - kern/subr_bus.c (like dev/random but more justified) - netinet/sctp* (like dev/de) - DELAY() is used at low levels where it is assumed to be safe. But sometimes it uses a timecounter. - cpu_ticks() is sometimes used instead of misusing get_cyclecount(). This is better. On x86, cpu_ticks() is the same as rdtsc() if the CPU supports rdtsc(), down to not fencing it, but with only amd64 inlining get_cyclecount() == rdtsc() and neither x86 arch inlining cpu_ticks() as rdtsc(). cpu_ticks() has freqency support. The main problems with it are that the ticks can have a variable rate and no subsystem including the cpu_ticks() subsystem "integrates" this rate to get consistent times in seconds, and cpu_ticks() may also be a timecounter so it may be unsafe to use in low- level contexts. The following subsystems use cpu_ticks() instead of a timecounter or a hack for timestamps: - contrib/ena-com - dev/acpi/acpi_cpu.c. An important use - i386/include/cpu.h. This implements get_cyclecount() as cpu_ticks(), since this is more portable than using rdtsc(). - schedulers use cpu_ticks() since timecounters were too slow when they were the i8254. Now even an HPET is still too slow, and only newer x86 can depend on having timecounters that are not too slow. Schedulers still use stathz ticks at a low frequency for actual scheduling since all hardware timers were too slow and many still are. Other complications for reading TSCs: - rdtscp is not very useful since it is only slightly better than lfence; rdtsc on Intel CPUs and mfence; rdtsc on AMD CPUs. In theory, it could be significantly better since it doesn't use global fences. It also reads a CPU id atomically with reading the TSC. This could be used for avoiding the interrupt masking in the above skew adjustment. - rdtscp and fences before rdtsc only give "release semantics". Another fence may be needed after rdtsc* to give "acquire" semantics -- to prevent moving stores before rdtsc*. I think x86 strong memory ordering works just as well here. E.g.: __asm __volatile("lfence"); volatile uint64_t tsc = rdtsc(); #if 0 /* * Not needed since tsc is volatile to force the memory order of * storing to tsc before storing to gate. Of course, rdtsc() must * be ordered before storing its result, so this prevents moving * the store to gate before rdtsc(). */ __asm __volatile("sfence"); #endif atomic_store_int(&gate, OPEN); For fixing simplistic timing code of the form: start = rdtsc(); do_work(); finish = rdtsc(); diff = finish - start; old references say to use cpuid() to serialize. cpuid() is slow. cpuid() before the first rdtsc() doesn't mess up the result for a simple test like this, but for general use for timestamps it is just slower so we should use a fence. However for timestamps it often doesn't matter if the start timestamp is early so we might be able to avoid the fence too. Then we need serialization something like the above to prevent the second rdtsc() being moved into do_work(). A complete old method using cpuid() is: cpuid(); start = rdtsc(); sfence(); /* maybe the wrong fence type */ do_work(); finish = rdtscp(); sfence(); /* maybe the wrong fence type */ cpuid(); /* just to be safe */ diff = finish - start; and I think this can be improved to: lfence(); /* mfence() on AMD */ start = rdtsc(); /* volatile start for ordering */ do_work(); lfence(); /* perhaps more portable than rdtscp() */ finish = rdtscp(); /* volatile finish for ordering */ diff = finish - start; and this should also make the acquire and release ordering more explicit. It assumes no TSC skew. However, if we only want an upper bound for the time, then we can optimize this by omitting some fences. Omitting the first fence only gives an earlier start and thus a larger diff. The first volatile, or a fence after the first rdtsc() seems to be necessary to prevent the first rdtsc() being moved after loads and stores in do_work(). After do_work(), the reverse applies -- ommitting the second volatile only gives a later finish, while the second fence is needed to prevent the second rdtsc() being moved too early. For free-running timestamps, ordering like the above is neither necessary nor sufficient. The simplistic timing code is now: clock_gettime(CLOCK_MONOTONIC, &start); do_work(); clock_gettime(CLOCK_MONOTONIC, &finish); diff = finish - start; /* actually timespecsub() */ and this needs to work without the user writing any ordering directives. "to work" doesn't seem to mean much in POSIX. Despite clock_gettime() being invented for POSIX.1b realtime extensions and these not becoming very common before 2001, the specifiers apparently didn't dream of relativistic causality problems for times. I think all that is required is for CLOCK_MONOTONIC to give monotonic times. A useul implementation should also guarantee program order for code like the above despite it not having any explicit ordering. This is not easy when the CPUs running the code are separated by many light-seconds and the clock can measure light-seconds. The above is in a single thread. Suppose this thread runs on only 1 CPU. Then skews don't matter, and even fences aren't needed to get monotonicity. Sprinkling fences (both load and store) for each rdtsc() also gives ordering for do_work(). Suppose the critical rdtsc()s in this thread run on different CPUs. Then skews matter. Uncorrected or garbage skews larger than the do_work() plus context switch time give negative or more than doubled differences. Since this problem is rarely noticed, the skews must be usually not that large. I think the garbage skews in dtrace are smaller than normal context switch time since IPIs are faster than switches (except for VM cases). Errors from not using fences are probably small compared with context switch times, thus never cause negative differences. It is necessary to have fences on both sides of rdtsc() to ensure ordering across CPUs. Otherwise, in theory the rdtsc() for 'start' might be so late and the rdtsc() for 'finish' might be so early that finish < start. For 3 timestamps instead of 2, the one in the middle must have fences on both sides; thus for a general API, all cases must have fences on both sides. In practice, context switches won't be done without a lot of stores which will order everything without explicit fences. Similarly for comparing times across threads, except it is skews larger than the IPC time that cause negative differences. Bruce