Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 23 Mar 2016 14:21:59 -0700
From:      John Baldwin <jhb@freebsd.org>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org, "'rstone@freebsd.org'" <rstone@freebsd.org>
Subject:   Re: svn commit: r297039 - head/sys/x86/x86
Message-ID:  <2922763.uITxoCVqGR@ralph.baldwin.cx>
In-Reply-To: <20160323075842.GX1741@kib.kiev.ua>
References:  <201603181948.u2IJmndg063765@repo.freebsd.org> <1866602.Bp7VFd5f42@ralph.baldwin.cx> <20160323075842.GX1741@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wednesday, March 23, 2016 09:58:42 AM Konstantin Belousov wrote:
> On Mon, Mar 21, 2016 at 11:12:57AM -0700, John Baldwin wrote:
> > On Saturday, March 19, 2016 05:22:16 AM Konstantin Belousov wrote:
> > > On Fri, Mar 18, 2016 at 07:48:49PM +0000, John Baldwin wrote:
> > > >  
> > > > -	for (x = 0; x < delay; x += 5) {
> > > > +	for (x = 0; x < delay; x++) {
> > > >  		if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) ==
> > > >  		    APIC_DELSTAT_IDLE)
> > > >  			return (1);
> > > > -		DELAY(5);
> > > > +		DELAY(1);
> > > >  	}
> > > >  	return (0);
> > > >  }
> > > 
> > > Ideally we would structure the loop differently. I think it is more
> > > efficient WRT latency to only block execution by ia32_pause(), and
> > > compare the the getbinuptime() results to calculate spent time, on each
> > > loop step.
> > 
> > Yes.  I've thought about using the TSC directly to do that, but folks are
> > worried about the TSC being unstable due to vcpus in a VM migrating
> > across physical CPUs.  DELAY() does seem to DTRT in that case assuming the
> > hypervisor doesn't advertise an invariant TSC via cpuid.  We'd have to
> > essentially duplicate DELAY() (really delay_tc()) inline.
> 
> If TSC has the behaviour you described, i.e. suddenly jumping random
> steps on single CPU, from the point of view of kernel, then the system
> is seriosly misbehaving.  The timekeeping stuff would be badly broken
> regardless of the ipi_wait().  I do not see why should we worry about
> that in ipi_wait().
> 
> I proposed slightly different thing, i.e. using timekeep code to indirect
> to TSC if configured so.  Below is the proof of concept patch, use of
> nanouptime() may be too naive, and binuptime() would cause tiny bit less
> overhead, but I do not want to think about arithmetic.

As you noted, the issue is if a timecounter needs locks (e.g. i8254) though
outside of that I think the patch is great. :-/  Of course, if the TSC
isn't advertised as invariant, DELAY() is talking to the timecounter
directly as well.

However, I think we probably can use the TSC.  The only specific note I got
from Ryan (cc'd) was about the TSC being unstable as a timecounter under KVM.
That doesn't mean that the TSC is non-mononotic on a single vCPU.  In fact,
thinking about this more I have a different theory to explain how the TSC
can be out of whack on different vCPUs even if the hardware TSC is in sync
in the physical CPUs underneath.

One of the things present in the VCMS on Intel CPUs using VT-x is a TSC
adjustment.  The hypervisor can alter this TSC adjustment during a VM-exit to
alter the offset between the TSC in the guest and the "real" TSC value in the
physical CPU itself.  One way a hypervisor might use this is to try to
"pause" the TSC during a VM-exit by taking TSC timestamps at the start and
end of a VM-exit and adding that delta to the TSC offset just before each
VM-entry.  However, if you have two vCPUs, one of which is running in the
guest and one of which is handling a VM-exit in the hypervisor, the TSC on
the first vCPU will run while the effective TSC of the second vCPU is paused.
When the second vCPU resumes after a VM-entry, it's TSC will now "unpause",
but it will lag the first vCPU by however long it took to handle its VM-exit.

It wouldn't surprise me if KVM was doing this.  bhyve does not do this to my
knowledge (so the TSC is actually still usable as a timecounter under bhyve
for some value of "usable").  However, even with this TSC pausing/unpausing,
the TSC would still increase monotonically on a single vCPU.  For the purposes
of DELAY() (and other spin loops on a pinned thread such as in
lapic_ipi_wait()), that is all you need.

> diff --git a/sys/x86/x86/local_apic.c b/sys/x86/x86/local_apic.c
> index 7e5087b..fdc12a4 100644
> --- a/sys/x86/x86/local_apic.c
> +++ b/sys/x86/x86/local_apic.c
> @@ -1621,31 +1621,37 @@ SYSINIT(apic_setup_io, SI_SUB_INTR, SI_ORDER_THIRD, apic_setup_io, NULL);
>   * private to the MD code.  The public interface for the rest of the
>   * kernel is defined in mp_machdep.c.
>   */
> +
> +/*
> + * Wait delay microseconds for IPI to be sent.  If delay is -1, we
> + * wait forever.
> + */
>  static int
>  native_lapic_ipi_wait(int delay)
>  {
> -	int x;
> +	struct timespec end, x;
>  
>  	/* LAPIC_ICR.APIC_DELSTAT_MASK is undefined in x2APIC mode */
>  	if (x2apic_mode)
>  		return (1);
>  
> -	/*
> -	 * Wait delay microseconds for IPI to be sent.  If delay is
> -	 * -1, we wait forever.
> -	 */
> -	if (delay == -1) {
> -		while ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) !=
> -		    APIC_DELSTAT_IDLE)
> -			ia32_pause();
> -		return (1);
> +	if (delay != -1) {
> +		KASSERT(delay > 0, ("wrong delay %d", delay));
> +		x.tv_sec = delay / 1000000;
> +		x.tv_nsec = (delay % 1000000) * 1000;
> +		nanouptime(&end);
> +		timespecadd(&end, &x);
>  	}
> -
> -	for (x = 0; x < delay; x++) {
> +	for (;;) {
>  		if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) ==
>  		    APIC_DELSTAT_IDLE)
>  			return (1);
> -		DELAY(1);
> +		ia32_pause();
> +		if (delay != -1) {
> +			nanouptime(&x);
> +			if (timespeccmp(&x, &end, >))
> +				break;
> +		}
>  	}
>  	return (0);
>  }


-- 
John Baldwin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?2922763.uITxoCVqGR>