From owner-svn-src-all@freebsd.org Wed Mar 23 21:22:03 2016 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 652F8ADB6A0; Wed, 23 Mar 2016 21:22:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 27F7012DA; Wed, 23 Mar 2016 21:22:03 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from ralph.baldwin.cx (c-73-231-226-104.hsd1.ca.comcast.net [73.231.226.104]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 35368B93A; Wed, 23 Mar 2016 17:22:02 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Cc: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org, "'rstone@freebsd.org'" Subject: Re: svn commit: r297039 - head/sys/x86/x86 Date: Wed, 23 Mar 2016 14:21:59 -0700 Message-ID: <2922763.uITxoCVqGR@ralph.baldwin.cx> User-Agent: KMail/4.14.3 (FreeBSD/10.2-STABLE; KDE/4.14.3; amd64; ; ) In-Reply-To: <20160323075842.GX1741@kib.kiev.ua> References: <201603181948.u2IJmndg063765@repo.freebsd.org> <1866602.Bp7VFd5f42@ralph.baldwin.cx> <20160323075842.GX1741@kib.kiev.ua> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Wed, 23 Mar 2016 17:22:02 -0400 (EDT) X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Mar 2016 21:22:03 -0000 On Wednesday, March 23, 2016 09:58:42 AM Konstantin Belousov wrote: > On Mon, Mar 21, 2016 at 11:12:57AM -0700, John Baldwin wrote: > > On Saturday, March 19, 2016 05:22:16 AM Konstantin Belousov wrote: > > > On Fri, Mar 18, 2016 at 07:48:49PM +0000, John Baldwin wrote: > > > > > > > > - for (x = 0; x < delay; x += 5) { > > > > + for (x = 0; x < delay; x++) { > > > > if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) == > > > > APIC_DELSTAT_IDLE) > > > > return (1); > > > > - DELAY(5); > > > > + DELAY(1); > > > > } > > > > return (0); > > > > } > > > > > > Ideally we would structure the loop differently. I think it is more > > > efficient WRT latency to only block execution by ia32_pause(), and > > > compare the the getbinuptime() results to calculate spent time, on each > > > loop step. > > > > Yes. I've thought about using the TSC directly to do that, but folks are > > worried about the TSC being unstable due to vcpus in a VM migrating > > across physical CPUs. DELAY() does seem to DTRT in that case assuming the > > hypervisor doesn't advertise an invariant TSC via cpuid. We'd have to > > essentially duplicate DELAY() (really delay_tc()) inline. > > If TSC has the behaviour you described, i.e. suddenly jumping random > steps on single CPU, from the point of view of kernel, then the system > is seriosly misbehaving. The timekeeping stuff would be badly broken > regardless of the ipi_wait(). I do not see why should we worry about > that in ipi_wait(). > > I proposed slightly different thing, i.e. using timekeep code to indirect > to TSC if configured so. Below is the proof of concept patch, use of > nanouptime() may be too naive, and binuptime() would cause tiny bit less > overhead, but I do not want to think about arithmetic. As you noted, the issue is if a timecounter needs locks (e.g. i8254) though outside of that I think the patch is great. :-/ Of course, if the TSC isn't advertised as invariant, DELAY() is talking to the timecounter directly as well. However, I think we probably can use the TSC. The only specific note I got from Ryan (cc'd) was about the TSC being unstable as a timecounter under KVM. That doesn't mean that the TSC is non-mononotic on a single vCPU. In fact, thinking about this more I have a different theory to explain how the TSC can be out of whack on different vCPUs even if the hardware TSC is in sync in the physical CPUs underneath. One of the things present in the VCMS on Intel CPUs using VT-x is a TSC adjustment. The hypervisor can alter this TSC adjustment during a VM-exit to alter the offset between the TSC in the guest and the "real" TSC value in the physical CPU itself. One way a hypervisor might use this is to try to "pause" the TSC during a VM-exit by taking TSC timestamps at the start and end of a VM-exit and adding that delta to the TSC offset just before each VM-entry. However, if you have two vCPUs, one of which is running in the guest and one of which is handling a VM-exit in the hypervisor, the TSC on the first vCPU will run while the effective TSC of the second vCPU is paused. When the second vCPU resumes after a VM-entry, it's TSC will now "unpause", but it will lag the first vCPU by however long it took to handle its VM-exit. It wouldn't surprise me if KVM was doing this. bhyve does not do this to my knowledge (so the TSC is actually still usable as a timecounter under bhyve for some value of "usable"). However, even with this TSC pausing/unpausing, the TSC would still increase monotonically on a single vCPU. For the purposes of DELAY() (and other spin loops on a pinned thread such as in lapic_ipi_wait()), that is all you need. > diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c > index 7e5087b..fdc12a4 100644 > --- a/sys/x86/x86/local_apic.c > +++ b/sys/x86/x86/local_apic.c > @@ -1621,31 +1621,37 @@ SYSINIT(apic_setup_io, SI_SUB_INTR, SI_ORDER_THIRD, apic_setup_io, NULL); > * private to the MD code. The public interface for the rest of the > * kernel is defined in mp_machdep.c. > */ > + > +/* > + * Wait delay microseconds for IPI to be sent. If delay is -1, we > + * wait forever. > + */ > static int > native_lapic_ipi_wait(int delay) > { > - int x; > + struct timespec end, x; > > /* LAPIC_ICR.APIC_DELSTAT_MASK is undefined in x2APIC mode */ > if (x2apic_mode) > return (1); > > - /* > - * Wait delay microseconds for IPI to be sent. If delay is > - * -1, we wait forever. > - */ > - if (delay == -1) { > - while ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) != > - APIC_DELSTAT_IDLE) > - ia32_pause(); > - return (1); > + if (delay != -1) { > + KASSERT(delay > 0, ("wrong delay %d", delay)); > + x.tv_sec = delay / 1000000; > + x.tv_nsec = (delay % 1000000) * 1000; > + nanouptime(&end); > + timespecadd(&end, &x); > } > - > - for (x = 0; x < delay; x++) { > + for (;;) { > if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) == > APIC_DELSTAT_IDLE) > return (1); > - DELAY(1); > + ia32_pause(); > + if (delay != -1) { > + nanouptime(&x); > + if (timespeccmp(&x, &end, >)) > + break; > + } > } > return (0); > } -- John Baldwin