Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 24 Apr 2009 22:20:33 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Juergen Lock <nox@jelal.kn-bremen.de>
Cc:        kalinoj1@iem.pw.edu.pl, freebsd-emulation@FreeBSD.org
Subject:   Re: Recent qemu and timers issue
Message-ID:  <20090424201623.N887@besplex.bde.org>
In-Reply-To: <20090423214701.GA83621@triton.kn-bremen.de>
References:  <200904032223.n33MNTiq019599@triton.kn-bremen.de> <200904072137.n37LbbdC071227@triton.kn-bremen.de> <20090423214701.GA83621@triton.kn-bremen.de>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 23 Apr 2009, Juergen Lock wrote:

> On Tue, Apr 07, 2009 at 11:37:37PM +0200, Juergen Lock wrote:
>> In article <200904062254.37824.kalinoj1@iem.pw.edu.pl> you write:
>>> Dnia sobota 04 kwietnia 2009 o 00:23:29 Juergen Lock napisa=C5=82(a):
>>>> In article <c948bb4de85d1b2a340ac63a7c46f6d9@iem.pw.edu.pl> you write:
>>> ...
>> OK I did some tests in the meantime on my newest box that has 4 cpus and
>> is running 7-stable: (host)
>>>>
>>>>> My logs and console are full of messages like this:
>>>>> calcru: runtime went backwards from 11248014678717988 usec to 119242 usec
>>>>> for pid 60 (ps)
>>>>>
>>>>> calcru: runtime went backwards from 11248014678921829 usec to 2516605 us=
>>> ec
>>>>> for pid 60 (ps)
>>>>> calcru: runtime went backwards from 7014844 usec to 6270851 usec for pid=
>>> 0
>> ...
>>>>> I tried to use all possible timers using sysctl, where I have:
>>>>> TSC(800) HPET(900) ACPI-safe(850) i8254(0) dummy(-1000000)
>>>>> None of these helped.

None of these are normally used for calculating runtimes.  Normally
on i386, the TSC is used.  The only way to configure this is to edit
the source code.  Try removing the calls to set_cputicker() in the MD
code.  Then the MI and timecounter-based cputicker tc_cpu_ticks() will
be used.  A better implementation would use a user-selectable
timecounter-based cputicker in all cases, but usually not the system
timecounter since that is likely to be very slow so as to be more
accurate.

>>> At this moment it's very hard for me to judge if it's qemu or kernel issue..
>>> Any thoughts?
>>
>>  Yeah, could be one or the other.
>
> I just noticed running any FreeBSD guest with bootverbose (boot -v,
> or menu item 5 in the default beastie menu) gives me lots of these:
> 	t_delta 16.0109662fcd2712a0 too long
> 	t_delta 16.014a1099ee815ca0 too long
> 	t_delta 16.0108d9c2ecd404c0 too long
> 	t_delta 16.0108d032cb334b60 too long
> 	t_delta 15.f58306afb53419a0 too short
> 	t_delta 16.010a762538a38680 too long
> etc.  My guess is this could be related...

As you probably know, virtualizing timers is hard to do correctly (see
vmware docs), so the problem is most probably partly in qemu.  The
cputicker code is much more fragile and/or broken than the timecounter
code, so imperfect virtualization is likely to cause more problems
with it.  The above t_delta printfs are a sign of problems.

It is not surprising that the recalibration code (cpu_tick_calibrate()
(sic)) has timing problems with virtual timers.  When it detects a
problem, it prints the above; then it continues using the old (possibly
wrong) frequency.  When it doesn't detect a problem, and the recalibrated
frequency would be larger, then it continues using the new (possibly
wrong) frequency.

When it doesn't detect a problem, and the recalibrated frequency would
be smaller, then it is just broken -- it continues using the old
(certainly wrong) frequency.  When miscalibrations are normal as appears
to be the case with qemu, this bug probably makes their effects less
obvious:
     It makes the miscalibrated frequencies monotonically increasing
     so they will soon converge to a limit and the frequency used will
     then be invariant until the next full calibration.  When the limit
     is infinity, as is caused messing up the timers completely (e.g.,
     by sitting at a debugger prompt for more than a couple of seconds)
     then other effects are fairly visible -- all user and sys times
     are broken (nearly 0). When the limit is only a few percent larger
     than the average, the user and sys times will be only a few percent
     lower, provided the difference between the limit and the actual
     (transient) frequency is not so large that calcru() sees problems
     and tries to fix them.  Fixing there is fundamentally impossible
     but in some cases the damage is small enough to hide.

In all cases, continuing with a wrong frequency will cause problems with
the runtime calculations.  Continuing with a right (but changed) frequency
is little better.  calcru() is fundamentally broken if the frequency
actually changes, since it applies the current frequency to ticks
accumulated at old frequencies.

The following patch fixes a couple of the problems in recalibration:

% diff -c2 ./kern/kern_tc.c~ ./kern/kern_tc.c
% *** ./kern/kern_tc.c~	Thu Mar 20 01:05:27 2008
% --- ./kern/kern_tc.c	Thu Mar 20 01:05:29 2008
% ***************
% *** 884,888 ****
%   		return;
% 
% ! 	getbinuptime(&t_this);
%   	c_this = cpu_ticks();
%   	if (t_last.sec != 0) {
% --- 884,888 ----
%   		return;
% 
% ! 	binuptime(&t_this);
%   	c_this = cpu_ticks();
%   	if (t_last.sec != 0) {

The existence of getbinuptime() is a bug, and its use here is especially
wrong.  We should use the current time, and binuptime() gives this as
well as possible.  The caller has just set the time cached for
getbinutime() to the current time, but this is not quite the same.
There is no locking here, except the implicit locking given by being
in a fast interrupt handler, so the time between the caching and the
use of the value is unbounded.  We may be interrupted even if we are
in a fast interrupt handler, depending on the implementation of fast
interrupt handlers.  Emulation is quite likely to result in fast
interrupt handlers being interrupted -- they are probably as "fast"
(really low-latency) as possible on the target, but they may be
interrupted on the host and then long delayed on the target in some
cases (hopefully not often).  So using binuptime() is only slightly
better here -- a delay may occur after the timecounter is read, no
matter where it is read.  Something like the following is needed:

%%%
 	binuptime(&start_bt);
 	start_tsc = rdtsc();
 	do {
 		binuptime(&sample_bt);
 		tsc = rdtsc();
 		binuptime(&bt);
 		bintime_sub(&bt, &sample_bt);
 		if (bintimecmp(&bt, &best_bt, <)) {
 			best_bt = bt;
 			best_sample_bt = sample_bt;
 			best_tsc = tsc;
 		}
 		bt = sample_bt;
 		bintime_sub(&bt, &start_bt);
 	} while (++nsamples < min_nsamples ||
 	    bintimecmp(&best_bt, min_error_btp, >));
%%%

Reads of the timer being calibrated (TSC here) are sandwiched between
reads of the reference timer and ones that can't be measured to within
a few parts per billion are discarded.

I never saw any effects from the above change on non-virtual systems.
Also, the sanity checks (which print the t_delta message) never fired
for the extremely messed up timers given by running ddb.  When running
ddb, normal timing stops working after a few seoncs or millseconds since
timecounters require interrupts to work.  However, the cputicker keeps
working precisely if it is the TSC.  So the cputicker : delta-binuptime
ratio soon becomes effectively infinite.  The sanity checking code seems
to always fail to detect this problem.

% ***************
% *** 922,931 ****
%   			c_delta <<= 20;
%   			c_delta /= divi;
% ! 			if (c_delta  > cpu_tick_frequency) {
% ! 				if (0 && bootverbose)
% ! 					printf("cpu_tick increased to %ju Hz\n",
% ! 					    c_delta);
% ! 				cpu_tick_frequency = c_delta;
% ! 			}
%   		}
%   	}
% --- 922,930 ----
%   			c_delta <<= 20;
%   			c_delta /= divi;
% ! 			if ((0 && bootverbose))
% ! 				printf(
% ! 			    "cpu_tick_frequency changed from %ju to %ju Hz\n",
% ! 				    cpu_tick_frequency, c_delta);
% ! 			cpu_tick_frequency = c_delta;
%   		}
%   	}

This fixes the bug which makes cpu_tick_frequency non-decreasing.

Another thing you can try here is to edit the source code to change
the set_cputicker() calls to say that the frequency is not variable.
I used this temporarily to work around the non-decreasing calibration.
This should be the default for emulators for most cputickers -- emulators
should emulate a constant frequency and not emulate the complexities
for pwoer saving.  This should probably be the default when the cputicker
is the i386 TSC and the TSC is P-state invariant.  Currently, P-state
invariance only causes turning off of cpufreq's adjustment of the TSC
as a timecounter.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20090424201623.N887>