From owner-cvs-all@FreeBSD.ORG  Mon Nov 28 18:51:59 2005
Return-Path: <owner-cvs-all@FreeBSD.ORG>
X-Original-To: cvs-all@FreeBSD.org
Delivered-To: cvs-all@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 411C216A422;
	Mon, 28 Nov 2005 18:51:59 +0000 (GMT)
	(envelope-from phk@critter.freebsd.dk)
Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222])
	by mx1.FreeBSD.org (Postfix) with ESMTP id C7F5943D68;
	Mon, 28 Nov 2005 18:51:51 +0000 (GMT)
	(envelope-from phk@critter.freebsd.dk)
Received: from critter.freebsd.dk (unknown [192.168.48.2])
	by phk.freebsd.dk (Postfix) with ESMTP id 92EA7BC66;
	Mon, 28 Nov 2005 18:51:48 +0000 (UTC)
To: Robert Watson <rwatson@FreeBSD.org>
From: "Poul-Henning Kamp" <phk@phk.freebsd.dk>
In-Reply-To: Your message of "Sun, 27 Nov 2005 01:03:59 GMT."
	<20051127005622.H81764@fledge.watson.org> 
Date: Mon, 28 Nov 2005 19:51:48 +0100
Message-ID: <5744.1133203908@critter.freebsd.dk>
Sender: phk@critter.freebsd.dk
Cc: cvs-src@FreeBSD.org, src-committers@FreeBSD.org, cvs-all@FreeBSD.org
Subject: Re: cvs commit: src/sys/sys time.h src/sys/kern kern_time.c 
X-BeenThere: cvs-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: CVS commit messages for the entire tree <cvs-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/cvs-all>
List-Post: <mailto:cvs-all@freebsd.org>
List-Help: <mailto:cvs-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/cvs-all>,
	<mailto:cvs-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 28 Nov 2005 18:51:59 -0000


This is a joint reply to all that has piled up in my mail-box on
this topic while I was being Robert Watson at EuroBSDcon2005[1].

First Bruce:

>(1) tc_windup() has no explicit locking, so it can run concurrently
>     on any number of CPUs, with N-1 of the CPUs calling it via
>     clock_settime() and 1 calling it via hardclock (this one may also
>     be the same as one already in it).  I doubt that the generation
>     stuff is enough to prevent problems here, especially with bug (2).

It is not as severe as you try to make it sound, but a mutex would
probably be in order at some point.

>(2) The generation count stuff depends on writes as seen by other CPUs
>     being ordered.  This is not true on many current CPUs.  E.g., on
>     amd64, writes are ordered as seen by readers on the current CPU,
>     but they may be held in a buffer and I think the buffer can be
>     written an any order to main memory.  I think this only gives a
>     tiny race window.  There is a mutex lock in all (?) execution paths
>     soon after tc_windup() returns, and this serves to synchronize writes.

Yes, a write barrier have been on my todolist for some time here.

Your observations about how out of whac^H^H^H^Hstep things are these
days is seconded apart from the bit about deliberately making it worse.

The fact that your own fix cost 8% in performance is very much support
to my opinion that any attempt to speed it up by adding complexity is
doomed from the start.

Then Robert (on programs with event engines):

Yes, event engines have an issue here and yes a fast 1/HZ clock
would be nice, but if we also move in the direction of a precise
timeout using HPET like hardware for deadline interrupting, then
1/HZ will probably belowered significantly and it will almost
certainly no longer be the number we are looking for.  That is why
I clipped the get*time() family to aim for "up to 1 ms" precision.

> BTW, simple loopback network testing seems to dramatically confirm that 
> the impact of time measurement and context switching is quite significant. 

This is why I decided long time ago to implement timestamps in a
way that would not require or trigger context switches.  Getting
timestamps is a lock-less process, provided you have non-neandertal
hardware (ie: almost anything but i8254 timecounter).

With respect to the timekeeping inherent in the context-switch, I think
we have a concensus on redefining CPU seconds in times(2) to something
sensible when faced with variable CPU clock rate, and that should
hopefully lower the cost of context switches.  I hope to spew out
a proof of concept patch this week.

Then Bruce on event engines:

>I can see a use for making a timestamp after select() returns, not for
>timeout purposes since the timeout should normally be for emergencies and
>it's relative so it doesn't need the current time, but just to record when
>things happen.

This is unfortunately a too simplistic view of event engines.  If timeouts
were uniformly long, we could ignore the runtime of the programs event
handlers, but this is not the case in practice.

I've looked a lot at this in the ISC eventlib (bind8) but there is
no way to save one timestamp per iteration without getting creeping
imprecision in the timer controlled events.

>The environment variable (or a sysctl/sysconf variable like vfs.timestamp_
>precision but per-process or per-user) is probably needed, since you don't
>want to teach all applications about unportable CLOCK_*.

This was my first suggestion as well.  I will however defer to
anybody who is going to actually fix the ports.


Poul-Henning


[1] Yes, great conference, you missed out.  We beat OpenBSD approx
2:1 on the beer drinking contest and it seems the only reason DF
didn't have an empty glass was a couple of "non-judgemental"
participants who did one for all of the five projects :-)

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.