From owner-freebsd-arch@FreeBSD.ORG  Sun Jun  3 10:49:46 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 4E753106566B;
	Sun,  3 Jun 2012 10:49:46 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au
	[211.29.132.182])
	by mx1.freebsd.org (Postfix) with ESMTP id 295408FC1E;
	Sun,  3 Jun 2012 10:49:43 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q53AnRFV000363
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sun, 3 Jun 2012 20:49:29 +1000
Date: Sun, 3 Jun 2012 20:49:27 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Konstantin Belousov <kostikbel@gmail.com>
In-Reply-To: <20120603051904.GG2358@deviant.kiev.zoral.com.ua>
Message-ID: <20120603184315.T856@besplex.bde.org>
References: <CACfq090r1tWhuDkxdSZ24fwafbVKU0yduu1yV2+oYo+wwT4ipA@mail.gmail.com>
	<20120601193522.GA2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndC71=3Jo+BxQi==gCoLipBxj8X8XMBydjvrcKeGw+WOnA@mail.gmail.com>
	<20120602164847.GB2358@deviant.kiev.zoral.com.ua>
	<CAJ-FndAXFwuEspq+QeF0Hv1dr8JjREP=c=g3-abP=eoZ-D4hEg@mail.gmail.com>
	<CAJ-FndCpztSWyJo2hRVs5qu+vQOj9E1mPBhfVOxM_OC2eNac6A@mail.gmail.com>
	<20120602171632.GC2358@deviant.kiev.zoral.com.ua>
	<20120603063330.H3418@besplex.bde.org>
	<20120603051904.GG2358@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Gianni <gianni@FreeBSD.org>, Alan Cox <alc@rice.edu>,
	Alexander Kabaev <kan@FreeBSD.org>, Attilio Rao <attilio@FreeBSD.org>,
	Konstantin Belousov <kib@FreeBSD.org>, freebsd-arch@FreeBSD.org
Subject: Re: Fwd: [RFC] Kernel shared variables
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 03 Jun 2012 10:49:46 -0000

On Sun, 3 Jun 2012, Konstantin Belousov wrote:

> On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote:
>> On Sat, 2 Jun 2012, Konstantin Belousov wrote:
>>> ...
>>> In fact, I think that if the whole goal is only fast clocks, then we
>>> do not need any additional system mechanisms, since we can easily export
>>> coefficients for rdtsc formula already. E.g. we can put it into elf auxv,
>>> which is ugly but bearable.
>>
>> How do you get the timehands offsets?  These only need to be updated
>> every second or so, or when used, but how can the application know
>> when they need to be updated if this is not done automatically in the
>> kernel by writing to a shared page?  I can only think of the
>> application arranging an alarm signal every second or so and updating
>> then.  No good for libraries.
> What is timehands offsets ? Do you mean things like leap seconds ?

Yes.  binuptime() is:

% void
% binuptime(struct bintime *bt)
% {
% 	struct timehands *th;
% 	u_int gen;
% 
% 	do {
% 		th = timehands;
% 		gen = th->th_generation;
% 		*bt = th->th_offset;
% 		bintime_addx(bt, th->th_scale * tc_delta(th));
% 	} while (gen == 0 || gen != th->th_generation);
% }

Without the kernel providing th->th_offset, you have to do lots of ntp
handling for yourself (compatibly with the kernel) just to get an
accuracy of 1 second.  Leap seconds don't affect CLOCK_MONOTONIC, but
they do affect CLOCK_REALTIME which is the clock id used by
gettimeofday().  For the former, you only have to advance the offset
for yourself occasionally (compatibly with the kernel) and manage
(compatibly with the kernel, especially in the long term) ntp slewing
and other syscall/sysctl kernel activity that micro-adjusts th->th_scale.

> This is indeed problematic for auxv. For auxv it could be solved by
> providing offset for next recheck using syscalls, and making libc code to
> respect this offset. But, I do think that vdso in shared page
> is the right solution, not auxv.

timehands in a shared pages is close to working.  th_generation protects
things in the same way as in the kernel, modulo assumptions that writes
are ordered.

>> rdtsc is also very unportable, even on CPUs that have it.  But all other
>> x86 timecounter hardware is too slow if you want gettimeofday() to be fast
>> and as accurate as it is now.
> !rdtsc hardware is probably cannot be used at all due to need to provide
> usermode access to device registers. The mere presence of rdtsc does not
> means that usermode indeed can use it, it should be decided by kernel
> based on the current in-kernel time source. If rdtsc is not usable, the
> corresponding data should not be exported, or implementation should go
> directly into syscall or whatever.

But then applications would:
- use gettimeofday() more than they should ("it works on Linux"), even
   more than now since when "it works on FreeBSD-x86" too
- just be slow when gettimeofday() is slow
- kludge around gettimeofday() being slow like they do now
- kludge around gettimeofday() being slow not like they do now (use more
   complications to probe it being slow).

I found some RedHat documentation for gettimeofday() in VDSO.  It seems
to leave it to the sysadmin to "tune" gettimeofday() using a boot
parameter to configure gettimeofday() being accurate/slow, less-accurate/
less-slow, or inaccurate/fast.  A per-process parameter would be more
correct and harder to use (add mounds of autoconfig and runtime code
in every program[mer] that cares to detect and use it).

> In fact, I would be very grateful if an expert in time-keeping provided
> concise description of the algorithm for translating rdtsc output into
> struct timeval, also enumerating required parameters.

See above.  You just scale

     tc_delta(th) == (uint32_t)(rdtsc() - rdtsc_offset) when th is for TSC,

using a carefully managed fixed point scale factor.  The delta is
reduced to 32 bits so that the scaling can be efficient.  The result
is a bintime fraction which is added to a bintime offset.  Both offsets
are even more carefully managed, and everything is protected by
th_generation, and for optimality there are multiple timehands so that
th_generation very rarely changes underneath you.  The resulting bintime
is then converted to a timeval or timespec as required.  This gives
uptimes.  Another offset is added for real times.  Times in seconds are
handled more directly; it is assumed that time_t is atomic so that
th_generation is not needed for protecting them.

The TSC frequency is limited to about 4 GHz, so the above tc_delta()
works for about 4 seconds after rdtsc_offset is updated.  But the
bintime fraction only works for 1 second.  If either of these wraps,
then the result is still latter than the update time; however, it
may be earlier than a previous result.  So the update must occur at
least once per second for the TSC.  Otherwise, negative time
differences occur (the final result is in advance of th_offset since
the bintime fraction is >= 0, but will be before a previous final
result if the bintime fraction wraps).  Negative time differences are
more worse than lost "ticks" that cause all results to be in the past.
The updates are broken by at least stopping in ddb and perhaps by
suspend/resume.  The correct fix is probably to update (or just zap)
the timecounter as the first step of resuming from ddb or sleep (this
must be done before any other timecounter call).  Note that times
going backwards cannot detected in binuptime(), etc., since to detect
it you would have to write the previous time, but that would requires
pessimal locking that is intentionally left out.

Timecounter internals like th_offsets are currently private in kern_tc.c.
I don't like exposing them for this, or cloning them for FFCLOCK.

Bruce