From owner-svn-src-head@freebsd.org Thu Mar 21 19:32:41 2019 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5D48D154C898; Thu, 21 Mar 2019 19:32:41 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-it1-x143.google.com (mail-it1-x143.google.com [IPv6:2607:f8b0:4864:20::143]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D5329733AC; Thu, 21 Mar 2019 19:32:40 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by mail-it1-x143.google.com with SMTP id w18so50896itj.4; Thu, 21 Mar 2019 12:32:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=+/Jn17IkrjHrdzQA6BDiIsJxVjqIL7jQCGVltHf1gCA=; b=r2GbK6fX952KL2uVlGF6lTHUGVPVApTYlGDheJaqVMyXIUsKBCBqHGxXxTNeph/lBa JP2bXlXk5T4kizLOUxocUs8WvLh6NBLQKg4lrAf0W6CVkpWHAQqIKP3H793O9nfF/kCM p6sAyHnbJbZKMl7QOykE7EVfDH/wM+7OcT35yEqlPWIFTUnM6Ok3vUVvNCaXetCWZy5y BiJIxrs3m2ukQvY7Cbhcu2EQvVCCrk+9O6Z+sSNmy3kiNCfwt/YLjcC6iTeR8MBrCsn+ TRt3p7h/3FnJYpuId8E0KkqQA1rrcJM6diCJc1316QiSy8KDW8fZJ8kS60ZFHuUkaiRJ YPcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to:user-agent; bh=+/Jn17IkrjHrdzQA6BDiIsJxVjqIL7jQCGVltHf1gCA=; b=hrX7EeE6vclFg73Nllf7oL00mw9fqaxBzHZnec0FOrk1z6xpG00NZmPIuhE3wEPc6S yh4pYrnWkf+IUIi84JFeoscMkGUOU3I73gJ8CT8TZKXhzVBca7lPcRs1xcojQo466Npc PqkoIburhmHzuuM0hPsdYqf8DtDfuLVIR2gUfEY2DnH5rsV1KStdqe2+rUc8ibl3df+P yjoHYUa5GlBqyGBOKc5mwzoTUp9TOi6i/JSfbDfocp3b06UnFWyKgoSPsbKadI49/fJP i3XbQJT73oU8+xGpYaWG9rfWPFutX5+iqWV+7GAyp6SCfF7Lka8P57EAd7Av6d0B3InE NPhQ== X-Gm-Message-State: APjAAAWxNI+VyqJAcOuLDH+ap5o6Jwj8JcfPO/I1EmMZOqm4rmAk23JL GI26BCNELbeinLBMDb3BH9Pya/ZePJA= X-Google-Smtp-Source: APXvYqxrzl0um4jjZmAJJKiZK5LWUkl+9JktuIRBw4UG5B+XpSEXdO83qi4hBYDXmS9yOODdmWBJnw== X-Received: by 2002:a02:3809:: with SMTP id b9mr2246942jaa.138.1553196760147; Thu, 21 Mar 2019 12:32:40 -0700 (PDT) Received: from raichu (toroon0560w-lp140-01-69-159-36-102.dsl.bell.ca. [69.159.36.102]) by smtp.gmail.com with ESMTPSA id n124sm252215itg.2.2019.03.21.12.32.38 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 21 Mar 2019 12:32:39 -0700 (PDT) Sender: Mark Johnston Date: Thu, 21 Mar 2019 15:32:37 -0400 From: Mark Johnston To: Bruce Evans Cc: rgrimes@freebsd.org, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r345359 - in head/sys/cddl/dev/dtrace: amd64 i386 Message-ID: <20190321193237.GF8186@raichu> References: <201903210252.x2L2qMSP022374@repo.freebsd.org> <201903210320.x2L3KIjc058579@gndrsh.dnsmgr.net> <20190321034917.GA8186@raichu> <20190321180738.P1024@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190321180738.P1024@besplex.bde.org> User-Agent: Mutt/1.11.3 (2019-02-01) X-Rspamd-Queue-Id: D5329733AC X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.98 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000,0]; NEURAL_HAM_SHORT(-0.99)[-0.985,0]; REPLY(-4.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000,0] X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 21 Mar 2019 19:32:41 -0000 On Thu, Mar 21, 2019 at 10:44:49PM +1100, Bruce Evans wrote: > On Wed, 20 Mar 2019, Mark Johnston wrote: > > > On Wed, Mar 20, 2019 at 08:20:18PM -0700, Rodney W. Grimes wrote: > >>> Author: markj > >>> Date: Thu Mar 21 02:52:22 2019 > >>> New Revision: 345359 > >>> URL: https://svnweb.freebsd.org/changeset/base/345359 > >>> > >>> Log: > >>> Don't attempt to measure TSC skew when running as a VM guest. > >>> > >>> It simply doesn't work in general since VCPUs may migrate between > >>> physical cores. The approach used to measure skew also doesn't > >>> make much sense in a VM. > > I think it simply doesn't work. > > jhb's src/tools/tools/tscdrift/tscdrift.c also simply doesn't work for > measuring drift. It mostly measures IPC latency and jitter from > preemption in attempts to measure skew. It might be possible to > determine the current skew using a statistical analysis of the latency > and a few more measurements (measure latency between each CPU in a > pair instead of only from CPU0 to other CPUs), but it would be better > to do more direct measurements. > > >> "May" is the important aspect here, and to my knowledge both > >> bhyve and Vmware provide a way to pin cpus there should be > >> a way for us to turn this back on if it is desired. Sticking > >> it under the big knob vm_guest is probably not the most flexiable > >> solution. > >> > >> Could we please have someway to force this back on? > > > > Even with pinning the skew computation is bogus in a VM. On an idle > > host running bhyve with pinned vCPUs it gives offsets that are several > > orders of magnitude larger than on the host. I would prefer to see a > > specific reason for wanting the old behaviour before adding a knob. > > The method is bogus (buggy) out of a VM. > > XX static void > XX dtrace_gethrtime_init(void *arg) > XX { > XX struct pcpu *pc; > XX cpuset_t map; > XX int i; > XX #endif > XX > XX /* The current CPU is the reference one. */ > XX sched_pin(); > > It is also necessary to disable interrupts in hardware. Or maybe better > and necessary, detect when they occur and discard related samples. It > is impossible to disable NMIs. Not quite, just carry out the measurement from an NMI handler. :) > Then there are SMIs. Not to mention > virtualization. At best, it gives delays a bit like for SMIs, but longer. > > Since this is init code, perhaps interrupts are already masked in hardware, > but to detect drift in the skews, it is necessary to run a routine like this > quite often. It is worse: on amd64 and i386 dtrace is never compiled into the kernel (as you discovered), so the measurement is done whenever the KLDs happen to be loaded. > XX tsc_skew[curcpu] = 0; > XX CPU_FOREACH(i) { > XX if (i == curcpu) > XX continue; > XX > XX pc = pcpu_find(i); > XX CPU_SETOF(PCPU_GET(cpuid), &map); > XX CPU_SET(pc->pc_cpuid, &map); > XX > XX smp_rendezvous_cpus(map, NULL, > XX dtrace_gethrtime_init_cpu, > XX smp_no_rendezvous_barrier, (void *)(uintptr_t) i); > > Some synchronization is necessary, but this is very heavyweight. Even > the lightweight synchronization in tscdrift.c has an average latency of > about 150-200 cycles on freefall's Xeon: afer fixing bugs in it: > > CPU | TSC skew (min/avg/max/stddev) > ----+------------------------------ > 0 | 0 0 0 0.000 > 1 | 36 65 47913 101.227 # lower latency on same core > 2 | 144 180 52977 139.862 > 3 | 147 180 34077 113.543 > 4 | 135 181 49335 228.104 > 5 | 123 175 33300 91.703 > 6 | 165 188 129825 166.999 > 7 | 165 189 248010 278.321 > > tscdrift measures from CPU0 to CPU_n for each n. After fixing it to > read the TSC on CPU0 before doing an IPC to allow reading the TSC on > CPU_n, this measures mainly the IPC latency. Actually reading the TSC > takes about 24 cycles. The skew appears to be 0 for all CPUs. Anyway, > it is so much smaller than the IPC latency that it can be ignored, provided > it doesn't drift to become non-small, and applications don't try to be > too smart about correcting for the latency. > > The unfixed tscdrift allows the other CPU to read the TSC before reading > it itself. This is more like the above. It still ends up measuring mainly > IPC latency and other bugs in the program, since the read of the TSC on > the same CPU occurs relatively instantly without the other bugs, and out > of order (this possibly earlier than instantly) with the other bugs. The > IPC delay is about the same as above, but it takes more complicated > statistics to prove this. > > XX > XX tsc_skew[i] = tgt_cpu_tsc - hst_cpu_tsc; > > This has even more complicated ordering. The IPC delay depends on the > internals of smp_rendezvous_cpus(). > > It should work better to schedule comparisons of 2 CPUs using a third CPU > (this is possible except with precisely 2 CPUs). 2 CPUs sit waiting for > permission to read their TSC. Another CPU sets a variable to grant > permission. > > I plan to try a variation of this: all CPUs spin reading the TSC and > appending the result atomically to a log buffer. Then analyze the results. > With no skew or interrupts, it is expected that the results are almost > monotonic. Contention causes non-monotonicity only occasionally. The > average non-mononicity should be 0. If there is significant skew, then > the order for the sameples form each pair of CPUs will reverse on almost > every second reading. The skew is the adjustment needed to stop this > reversal. > > Logs of the TSC for a single CPU show that the granularity is about 18 > on freefall's Xeon. I don't know yet if this is more closely related to > the clock rate or the rdtsc latency. The TSC frequency is 3.3GHz. I > recently learned that a common implementation uses a shared clock of > frequency only 100 MHz. With no interpolation of this, the granularity > is 33 cycles. rdtsc itself takes about 18 cycles, so not much is gained > from any interpolation, but some seems to be done. > > On Haswell i4790K at 4 GHz, IIRC rdtsc seemed to take 24 cycles. I haven't > done the test to see if non-multiples of 24 are returned. A 100 MHz clock > would have a granularity of 40 cycles, so there is apparently some > interpolation. > > A commit log message by mav@ says that some Skylakes never return odd values > for rdtsc, and some appear to have a granularity of about 186 (?) cycles. > freefall's Xeon does return odd values for rdtsc. > > XX } > XX sched_unpin(); > XX } > XX ... > XX /* > XX * DTrace needs a high resolution time function which can > XX * be called from a probe context and guaranteed not to have > XX * instrumented with probes itself. > XX * > XX * Returns nanoseconds since boot. > XX */ > > dtrace does too much hand-rolling here, but it has to do something > to handle the 2 problems mentioned in the comment. Other subsystems > handle this worse (*). > > But it shouldn't try harder than the TSC timecounters to handle skew. > The TSC timecounter ignores the problem of skew, but does other things > more carefully. It comes from Solaris/illumos, which uses the following technique to measure skew during boot: AP i: while (!flag) tsc[i] = rdtsc(); BSP: flag = 1; mfence(); tsc = rdtsc(); skew[i] = tsc - tsc[i]; if (skew[i] < 0) skew[i] = -skew[i]; There are some heuristics that attempt to correct for the latency added by the mfence(). I don't think it makes much sense to measure skew on systems with an invariant TSC. If the TSC is not invariant, measuring it once at load time is likely not enough. Based on TSC-invariant systems I looked at, the measured skew is small enough that the erroneous deltas probably don't cause much harm. In VMs, the deltas are large enough that they create noticeable problems.