From owner-freebsd-stable@FreeBSD.ORG  Mon Sep 21 19:12:14 2009
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 37615106566B
	for <stable@freebsd.org>; Mon, 21 Sep 2009 19:12:14 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.freebsd.org (Postfix) with ESMTP id BF1FC8FC0A
	for <stable@freebsd.org>; Mon, 21 Sep 2009 19:12:11 +0000 (UTC)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.14.2/8.14.1) with ESMTP id n8LIxx4P028785;
	Mon, 21 Sep 2009 11:59:59 -0700 (PDT)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.14.2/8.13.4/Submit) id n8LIxxZv028784;
	Mon, 21 Sep 2009 11:59:59 -0700 (PDT)
Date: Mon, 21 Sep 2009 11:59:59 -0700 (PDT)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200909211859.n8LIxxZv028784@apollo.backplane.com>
To: stable@freebsd.org, Peter Wemm <peter@wemm.org>
References: <20090906155154.GA8283@onelab2.iet.unipi.it>
	<e7db6d980909061736p4affc054k3fa5070214adc2f8@mail.gmail.com>
	<20090907072159.GA18906@onelab2.iet.unipi.it>
	<6F002A04-5CF9-466F-AEFB-6B983C0E1980@mac.com>
Cc: 
Subject: Re: incorrect usleep/select delays with HZ > 2500
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 21 Sep 2009 19:12:14 -0000

    What we wound up doing was splitting tvtohz() into two functions.

    tvtohz_high(tv)

	Returned value meets or exceeds requested time.  A minimum value
	of 1 is returned (really only for {0,0}.. else minimum value is 2).

    tvtohz_low(tv)

	Returned value might be shorter then requested time, and 0 can
	be returned.

    Most kernel functions use the tvtohz_high() function.  Only a few
    use tvtohz_low().

    I have not found any 'good' solution to the problem.  For example,
    average-up errors can mount up when using the results to control a
    callout timer resulting in much longer delays then originally intended,
    and similarly same-tick interrupts (e.g. a value of 1) can create
    much shorter delays then expected.  Sometimes one cares more about
    the average interval being correct, other times the time must not
    be allowed to be too short.  You lose no matter what you choose.

	http://fxr.watson.org/fxr/source/kern/kern_clock.c?v=DFBSD

    If you look at tvtohz_high() you will note that the minimum value
    of 1 is only returned if the passed tv is essentially {0,0}.  i.e. 0uS.
    1uS == 2 ticks (((us + (tick - 1)) / tick) + 1).  The 'tick' global
    here is the number of uS per tick (not to be confused with 'ticks').

    Because of all of that I decided to split the function to make the
    requirements more apparent.

    --

    The nanosleep() work is a different issue... that's for userland calls
    (primarily the libc usleep() function).  We found that some linux
    programs assumed that nanosleep() was far more fine-grained then (hz)
    and, anyway, the system call is called 'nanosleep' and 'usleep' which
    kind of implies a fine-grained sleep, so we turned it into one when
    small time intervals were being requested.

	http://fxr.watson.org/fxr/source/kern/kern_time.c?v=DFBSD

    The way I figure it if a userland program wants to make system calls
    with fine-grained sleeps that are too small, it's really no different
    from treating that program as being cpu-bound anyway so why not try to
    accomodate it?

    --

    The 8254 issue is more one of a lack of interest in fixing it.
    Basically using the 8254 as a measure of realtime when the reload
    value is set to small (i.e. high hz) will always lead to serious
    timing problems.  The reason there is such a lack of interest
    in fixing it is that most machines have other timers available
    (lapic, acpi, hpet, tsc, etc).  A secondary issue might be tying
    real-time functions to 'ticks', which could still be driven by the
    8254 interrupt.... those have to be divorced from ticks.  I'm not
    sure if FreeBSD has any of those left (does date still skip quickly if
    hz is set ultra-high?  Even when other timers are available?).

    I will note that tying real-time functions to the hz-based tick
    function (which is also the 8254-driven problem when other timers
    are not available) leads to serious problems, particularly with ntpd,
    even if you only lose track of the full cycle of the timer
    occassionally.

    However, neither do you want to 'skip' the ticks value to catch up
    to a lost interrupt.  That will mess up tsleep() and other hz-based
    timeouts that assume that values of '2' will not instantly
    timeout.

    So actual realtime operations really do have to be completely divorced
    from the hz-based ticks counter and it must only be used for looser
    timing needs such as protocol timeouts and sleeps.

						-Matt