From owner-freebsd-arch@FreeBSD.ORG  Tue Mar  5 10:43:50 2013
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@FreeBSD.org
Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115])
 by hub.freebsd.org (Postfix) with ESMTP id 08F098C7;
 Tue,  5 Mar 2013 10:43:50 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from fallbackmx07.syd.optusnet.com.au
 (fallbackmx07.syd.optusnet.com.au [211.29.132.9])
 by mx1.freebsd.org (Postfix) with ESMTP id 241C2622;
 Tue,  5 Mar 2013 10:43:48 +0000 (UTC)
Received: from mail17.syd.optusnet.com.au (mail17.syd.optusnet.com.au
 [211.29.132.198])
 by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
 r25AhfEo002465; Tue, 5 Mar 2013 21:43:41 +1100
Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au
 (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106])
 by mail17.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r25AhUe1023287
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Tue, 5 Mar 2013 21:43:32 +1100
Date: Tue, 5 Mar 2013 21:43:30 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Alexander Motin <mav@FreeBSD.org>
Subject: Re: tickless design guidelines
In-Reply-To: <5135B934.5060202@FreeBSD.org>
Message-ID: <20130305204500.T1059@besplex.bde.org>
References: <20130305080134.GC13187@onelab2.iet.unipi.it>
 <5135AFAD.70408@FreeBSD.org>
 <20130305090735.GB18221@onelab2.iet.unipi.it> <5135B934.5060202@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=DdhPMYRW c=1 sm=1 a=kj9zAlcOel0A:10
 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=ZGdlnkk9SXAA:10
 a=u4QguocfA062RjAEVIMA:9 a=CjuIK1q_8ugA:10 a=fZR5MC7dhzonfZNR:21
 a=1N-mQgnlXDl8t-hT:21 a=TEtd8y5WR3g2ypngnwZWYw==:117
Cc: Davide Italiano <davide@FreeBSD.org>, arch@FreeBSD.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 05 Mar 2013 10:43:50 -0000

On Tue, 5 Mar 2013, Alexander Motin wrote:

> On 05.03.2013 11:07, Luigi Rizzo wrote:
>> Also i wonder if it may make sense to add a feature so that whenever
>> we get an interrupt and a fast and suitable timecounter is available,
>> some system-wide bintime is updated.

bintime is intentionally updated as little as possible, since updating
it is an expensive operation that would need even more expensive locking
if it is done often.  Note that get*time() is fundamentally broken,
since the timestamps that it returns are incoherent with timestamps
returned by non-get*time().  I used to sync get*time() with non-get*time()
on every call to the latter.  E.g., on every call to bintime(),
time_second, timehands->th_microtime and timehands->th_nanotime are
updated with the time just read by binuptime().  getbintime(), etc.,
use the possibly-updated values.  This makes getbintime(), etc., more
accurate if calls to them are mixed with calls to bintime(), but this
is a side effect.  The sync is done for correctness.  It ensures that
the time is monotonic when read using any mixture of get/non-get*time()
APIs.  File timestamps written using the value of time_second or
getnanotime() are noticebly broken when compared with the real time
read using gettimeofday() without this.Z.  But this is slow, and phk
didn't like it.  I only did it for the !SMP cases and dropped it after
a few years.

Your suggestion is similar.  Updating the system-wide bintime requires
knowing the current bintime, and nothing much more or less than a very
recent read of a hardware timecounter gives that.  The difficulty is
atomically updating the whole timecounter state with values derived
from that.

>> So getbinuptime() could just use this counter, becoming extremely
>> cheap (perhaps it is already, i am not sure) and in the long term,
>> as CPUs with fixed frequency TSC become ubiquitous,
>> we would get higher resolution as the interrupt load increases.

getbinuptime() is already extremely cheap, but imprecise and incoherent.

If you assume a fixed frequency TSC, then you can just always use
bintime().  Unfortunately, the same CPUs that have a fixed frequency
TSC also have slow-to-read TSCs.  The TSC read time increased from ~9
cycles on AthlonXP and Athlon64 to 42 on Phenom, and it core2 is even
slower.  9 cycles were almost in the noise, so getbintime() was only
slightly faster than bintime() on Athlon64.

> Each time timer interrupt fires, or CPU timers "switched" between idle
> and active modes present time is fetched. The only question is how to
> store it somewhere safely. It was difficult for 96/128-bit struct
> bintime. It is easier for 64-bit sbintime_t, but still not all archs are
> able to do it atomically. The mechanism used for getbinuptime() now may
> not scale well to the very high update rates.

getbinuptime() with high update rates would reduce to a bad version of
binuptime().  Actually, getbinuptime() would probably be unchanged.  The
updates would be in binuptime().  You could update only every 10 or 100
usec to get not so high update rates but not coherency.

Actually^2, timekeeping is unimportant for timeouts.  Thus you could use
sloppy non-atomic updates for everything and barely notice losing the
races for this.  To get semi-coherent and more accurate (when the races
aren't lost) in sbinuptime(), just record a delta-time on every hardware
timecounter call.  Arrange to call hardware timecounters fairly often,
and add the delta time in sbinuptime().  Better not touch getbinuptime(),
etc.  Let them remain mostly race-free and incoherent.

Here is my old syncing code:

@ diff -c2 src/sys/kern/kern_tc.c~ src/sys/kern/kern_tc.c
@ *** src/sys/kern/kern_tc.c~	Fri Mar  5 17:03:07 2004
@ --- src/sys/kern/kern_tc.c	Fri Mar  5 17:03:08 2004
@ ***************
@ *** 172,175 ****
@ --- 224,254 ----
@   }
@ 
@ + #define SYNC_TIME
@ + 
@ + #if defined(SYNC_TIME) && !defined(SMP)
@ + static void
@ + sync_time(long sec, long nsec, long usec)

nanotime() and microtime() call here with the time that they just read.
I don't bother syncing full bintimes, since this was mainly intended to
fix file timestamps.

@ + {
@ + 	register_t saveintr;
@ + 
@ + 	saveintr = intr_disable();

Locking for !SMP only.

@ + 	if (time_second < sec)
@ + 		time_second = sec;
@ + 	if (timehands->th_microtime.tv_sec < sec ||
@ + 	    (timehands->th_microtime.tv_sec == sec &&
@ + 	    timehands->th_microtime.tv_usec < usec)) {
@ + 		timehands->th_microtime.tv_sec = sec;
@ + 		timehands->th_microtime.tv_usec = usec;
@ + 	}
@ + 	if (timehands->th_nanotime.tv_sec < sec ||
@ + 	    (timehands->th_nanotime.tv_sec == sec &&
@ + 	    timehands->th_nanotime.tv_nsec < nsec)) {
@ + 		timehands->th_nanotime.tv_sec = sec;
@ + 		timehands->th_nanotime.tv_nsec = nsec;
@ + 	}

The secondary values are now coherent for use in getnanotime(), etc.

sbintime() support could set the final value here instead of a delta.
The reason to use a delta is that it fits in 32 bits so is easy to
access atomically, so that the write here doesn't need much locking.
Oops, the locking here is inadequate even for !SMP -- the updates
here may be non-monotonic since the lock is not around the whole
timcounter-read...update sequence.

@ + 	intr_restore(saveintr);
@ + }
@ + #endif /* SYNC_TIME && !SMP */
@ + 
@   void
@   bintime(struct bintime *bt)
@ ***************
@ *** 189,192 ****
@ --- 268,274 ----
@   	bintime(&bt);
@   	bintime2timespec(&bt, tsp);
@ + #if defined(SYNC_TIME) && !defined(SMP)
@ + 	sync_time(tsp->tv_sec, tsp->tv_nsec, tsp->tv_nsec / 1000);
@ + #endif
@   }
@

Any call to bintime() syncs time_second, and the timespec and timeval in
the timehands, but not the full bintime.  So for example, getnanotime()
benefits from bintime() being called, but getbintime() remains incoherent
with bintime().

Bruce