Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Mar 2016 03:17:43 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        John Baldwin <jhb@freebsd.org>, src-committers@freebsd.org,  svn-src-all@freebsd.org, svn-src-head@freebsd.org,  "'rstone@freebsd.org'" <rstone@freebsd.org>
Subject:   Re: svn commit: r297039 - head/sys/x86/x86
Message-ID:  <20160326021219.X911@besplex.bde.org>
In-Reply-To: <20160325084902.GH1741@kib.kiev.ua>
References:  <201603181948.u2IJmndg063765@repo.freebsd.org> <1866602.Bp7VFd5f42@ralph.baldwin.cx> <20160323075842.GX1741@kib.kiev.ua> <2922763.uITxoCVqGR@ralph.baldwin.cx> <20160324090917.GC1741@kib.kiev.ua> <20160325010649.H898@besplex.bde.org> <20160324162447.GD1741@kib.kiev.ua> <20160325060901.N2059@besplex.bde.org> <20160325084902.GH1741@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 25 Mar 2016, Konstantin Belousov wrote:

> On Fri, Mar 25, 2016 at 07:13:54AM +1100, Bruce Evans wrote:
>> On Thu, 24 Mar 2016, Konstantin Belousov wrote:
> [Skipped lock adaptive spinning text for now].
>>
>>>> My systems allow speed variations of about 4000:800 = 5:1 for one CPU and
>>>> about 50:1 for different CPUs.  So the old method gave a variation of up
>>>> to 50:1.  This can be reduced to only 5:1 using the boot-time calibration.
>>> What do you mean by 'for different CPUs' ?  I understand that modern ESS
>>> can give us CPU frequency between 800-4200MHz, which is what you mean
>>> by 'for one CPU'.  We definitely do not care if 5usec timeout becomes
>>> 25usecs, since we practically never time-out there at all.
>>
>> Yes, I actually get 4400:800 on i4790K.
>>
>> The ratio is even larger than that with a hard-coded limit because old
>> CPUs are much slower than i4790K.  I sometimes run a 367 MHz (P2 class)
>> CPU.  It is several times slower than a new CPU at the same clock
>> frequency, and any throttling would make it even slower.
>>
>> 50 times slower means that a reasonable emergency timeout of 60 seconds
>> becomes 3000 seconds.  Local users would get tired of waiting and reset,
>> and remote users might have to wait.
> But you do not downclock a machine booted at the 4.0Ghz datasheet clock,
> down to 367Mhz. For 400Mhz P2 machine, LAPIC would be calibrated at that
> 400Mhz rate.

I was considering what happens with hard-coded (uncalibrated) timeout.

>> There is another thread about early DELAY() using the i8254 not working
>> to calibrate the TSC.  That might be just because DELAY() is interrupted.
>> DELAY() never bothered to disable interrupts.  Its early use for calibrating
>> the TSC depends on interrupts mostly not happening then.  (My version is
>> a bit more careful, but it still doesn't disable interrupts.  It
>> establishes error bounds provided interrupts are shorter than the i8254
>> wrap period.)  If the i8254 is virtual, then even disabling interrupts
>> on the target wouldn't help, since the disabling would only be virtual.
>
> Yes, the DELAY() calibration is something I wanted to ask about.
> Could you, please, take a look at
> https://reviews.freebsd.org/D5738
> there is a code which would benefit from better (re-)calibration.

I found that hard to read (using an old version of w3m, the UI is
horrible and the comments don't have enough context; then the old
version of w3m can't display files or diffs).

I use the following TSC calibration code in some kernels:

X static void
X xdel(int t0c, uint64_t *initial_tsc, uint64_t *final_tsc, int *delta_t0c)
X {
X 	int high, low, n, next, prev;
X 
X 	outb(TIMER_MODE, TIMER_SEL0 | TIMER_LATCH);
X 	*initial_tsc = rdtsc();
X 	low = inb(TIMER_CNTR0);
X 	high = inb(TIMER_CNTR0);
X 	prev = (high << 8) | low;
X 	for (n = 0; n < t0c; ) {
X 		outb(TIMER_MODE, TIMER_SEL0 | TIMER_LATCH);
X 		*final_tsc = rdtsc();
X 		low = inb(TIMER_CNTR0);
X 		high = inb(TIMER_CNTR0);
X 		next = (high << 8) | low;
X 		if (next <= prev)
X 			n += prev - next;
X 		else
X 			n += (timer0_max_count + prev) - next;
X 		prev = next;
X 	}
X 	*delta_t0c = n;
X }
X 
X static uint64_t
X tsc_calibrate(void)
X {
X 	uint64_t tsc_freq;
X 	uint64_t tscval[2];
X 	int xdelval;
X 
X 	xdel(100, &tscval[0], &tscval[1], &xdelval);
X 	xdel(timer_freq, &tscval[0], &tscval[1], &xdelval);
X 	tsc_freq = (tscval[1] - tscval[0]) * timer_freq / xdelval;
X 	if (1 || bootverbose)
X 		printf("TSC clock: %ju Hz\n", (uintmax_t)tsc_freq);
X 	if (1 || bootverbose)
X 		printf("raw: %ju %ju %d\n", tscval[0], tscval[1], xdelval);
X 	return (tsc_freq);
X }

This uses the i8254.  xdel() is a specialized version of i8254 DELAY()
with getit() inline.  It returns the initial and final values of the
i8254 counter.  It doesn't handle interrupts or any other source of
large clock jitter.  What it measures more precisely is the measurement
overhead.  This is normally 2-5 usec.  With a timer frequency of about
1 MHz, a 5 usec error is about 5 ppm.  Compensating for this reduces
the error to below 1 ppm if there are no interrupts,

tsc_calibrate() calls xdel() twice to determine the measurement overhead.
It should be called one more time to warm up the cache.

In other kernels, I use the following version using DELAY() which is
good enough if DELAY() works and is not delayed by interrupts

X diff -c2 ./x86/x86/tsc.c~ ./x86/x86/tsc.c
X *** ./x86/x86/tsc.c~	Sun Feb 14 21:56:28 2016
X --- ./x86/x86/tsc.c	Sun Feb 14 22:01:46 2016
X ***************
X *** 240,244 ****
X   {
X   	u_int regs[4];
X ! 	uint64_t tsc1, tsc2;
X 
X   	if (cpu_high >= 6) {
X --- 240,244 ----
X   {
X   	u_int regs[4];
X ! 	uint64_t tsc1, tsc2, tsc3;
X 
X   	if (cpu_high >= 6) {
X ***************
X *** 306,313 ****
X   	if (bootverbose)
X   	        printf("Calibrating TSC clock ... ");
X   	tsc1 = rdtsc();
X ! 	DELAY(1000000);
X   	tsc2 = rdtsc();
X ! 	tsc_freq = tsc2 - tsc1;
X   	if (bootverbose)
X   		printf("TSC clock: %ju Hz\n", (intmax_t)tsc_freq);
X --- 306,316 ----
X   	if (bootverbose)
X   	        printf("Calibrating TSC clock ... ");
X + 	DELAY(1000);
X   	tsc1 = rdtsc();
X ! 	DELAY(1000);
X   	tsc2 = rdtsc();
X ! 	DELAY(1000000);
X ! 	tsc3 = rdtsc();
X ! 	tsc_freq = tsc3 - tsc2 - (tsc2 - tsc1);
X   	if (bootverbose)
X   		printf("TSC clock: %ju Hz\n", (intmax_t)tsc_freq);

See also kern_tc.c:cpu_tick_calibrate().  This is quite accurate after
fixing its bugs.  It gets accuracy by timing over 16 seconds instead of
1 and by using a timecounter which is assumed to be accurate.

See also tsccalib/tsccalib.c in my home directory on freefall.  This
is a refined version of the above.  It uses the time returned by
clock_gettime() as a reference.  It compensates for interrupts and
runs for long enough to get the specified accuracy.  If you only want
a low accuracy like 1 ppm, this takes 1.8 msec on freefall (this depends
a lot on the speed of clock_gettime(2) -- 1.8 msec is with the fast
TSC timecounter in libc).

The worst case for all of these methods is if the i8254 is the only
timer.  Then tsccalib takes much longer to get an accurate calibration
because the error reading the timer is about its access time which is
very large for the i8254.  The i8254 otherwise works perfectly for
calibration provided its wrapping is always detected.

> Below is the patch to implement calibration of the ipi_wait() busy loop.
> On my sandybridge 3.4Ghz, I get the message
> LAPIC: ipi_wait() us multiplier 37 (r 128652089678 tsc 3392383992)

This seems OK, but it might belong closer to DELAY().

> ...
> +	counter = lapic_ipi_wait_mult * delay;
> +	for (i = 0; i < counter; i++) {
> 		if ((lapic_read_icr_lo() & APIC_DELSTAT_MASK) ==
> 		    APIC_DELSTAT_IDLE)
> 			return (1);
> -		DELAY(1);
> +		ia32_pause();
> 	}

This part is basically DELAY() implemented as a simple loop with a
callback in the loop.  I don't like callbacks and prefer direct code
like the above.  The direct code is basically DELAY() implemented as
a simple loop and cloned to add a simple check on very iteration.

If an error factor of 10 or so is acceptable, then the simple loop
is good enough for DELAY() too.  Or DELAY() can do:

 	while (n > 1000)
 		recalibrate_every_millisecond_while_reducing_n();
 	/*
 	 * We can't reasonably get better accuracy than a factor of 1000
 	 * for short delays, so don't try hard.  A single register read
 	 * can take over 100 cycles waiting for DMA and buses, and we
 	 * don't want to disable interrupts so the general case can
 	 * reasonably be delayed by several milliseconds for interrupt
 	 * handling.  We hope that the worst case in normal operation is
 	 * 1 quanta, giving an error factor of 100000 for DELAY(1).
 	 */
 	simple_loop();

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160326021219.X911>