From owner-freebsd-arch@FreeBSD.ORG  Wed Nov 29 22:44:02 2006
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: freebsd-arch@freebsd.org
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id ACEC516A412;
	Wed, 29 Nov 2006 22:44:02 +0000 (UTC)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2])
	by mx1.FreeBSD.org (Postfix) with ESMTP id EC8EE43CB4;
	Wed, 29 Nov 2006 22:43:48 +0000 (GMT)
	(envelope-from dillon@apollo.backplane.com)
Received: from apollo.backplane.com (localhost [127.0.0.1])
	by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATMhmqO048754;
	Wed, 29 Nov 2006 14:43:52 -0800 (PST)
Received: (from dillon@localhost)
	by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATMhmaY048753;
	Wed, 29 Nov 2006 14:43:48 -0800 (PST)
Date: Wed, 29 Nov 2006 14:43:48 -0800 (PST)
From: Matthew Dillon <dillon@apollo.backplane.com>
Message-Id: <200611292243.kATMhmaY048753@apollo.backplane.com>
To: John Baldwin <jhb@freebsd.org>
References: <10814.1164829546@critter.freebsd.dk>
	<200611291650.51782.jhb@freebsd.org>
Cc: Robert Watson <rwatson@freebsd.org>, freebsd-arch@freebsd.org
Subject: Re: a proposed callout API
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Nov 2006 22:44:02 -0000


:Hmm, I guess that depends on what you consider tick_t to be.  I was thinking 
:of it as an abstract type for a deadline, and that absolute and relative are 
:sort of like subclasses of that.  Doing it that way allows you to defer on 
:absolute times rather than requiring whole new APIs.  I assume you mean 
:passing a different flag to msleep(), cv_*(), sema_wait(), lockmgr(), etc. 
:that all take 'int timo'?  If you allow it to be encoded into the tick_t, 
:then adding support for it just requires a new function to generate a tick_t 
:object and the consuming code has to learn how to handle it, but all the 
:in-between stuff doesn't care and doesn't have to know.
:...
:John Baldwin
:_______________________________________________

    We have something called sysclock_t that is very similar.  These are
    the issues I encountered while implementing it:

    * I would recommend that the low level interfaces always operate as
      a deadline, because it simplifies the code enormously and makes it
      easy to detect negative-time events (which have to be turned into NOPs
      or minimally-timed events or looped events instead of sleeps).

    * If you intend to support negative numbers to mean relative times,
      the translation to a deadline should be done at the highest level.
      Lower levels should always ONLY operate as deadlines.

      For example, if you change msleep() over to use your tick_t, then
      allowing a negative number to indicate a relative time should be
      handled by msleep() and NOT by APIs at a deeper level then msleep().

      This is particularly important, I think, because you do not want to
      get into the business of having to check what kind of timeout you
      are dealing with in every single API layer as you push down into the
      system timer, and you don't want to get into the business of having
      to re-read the current timestamp from hardware over and over again.
      (see more about this later).

    * The maximum amount of time that can be represented by tick_t is an
      issue.  I decided that rather then try to represent any amount of time,
      I would instead stick to a 10-second limitation for our sysclock_t
      and only those APIs that required fine-grained timing (SYSTIMER API)
      would ever actually *USE* sysclock_t.

      This created a very explicit, well defined, and extremely visible
      'wall' between fine-grained timing APIs and course-grained timing 
      APIs.  I believe it made the related source code far more readable
      as well.

      This also allowed me to use a 32 bit integer to represent the 
      fine-grained time, which saved a great deal of memory for data 
      storage all over the system.

    * I wanted an absolute fine-grained timestamp that was visible to upper
      layers but could be represented in 32 bits.  The fact that the API
      was explicitly defined to cover only 10 seconds worth of (absolute or
      relative) time made manipulation of the timestamp extremely well
      defined.  I also required that rollover work as expected (that you
      could subtract a rolled-over absolute timestamp from a prior timestamp
      and still get the correct delta time, within the 10-second limitation).
      Having the APIs explicitly assume a 2's complement rollover made 
      all the coding easy.

      This worked extremely well in all respects.

    * The frequency didn't matter, as long as the requirements of the
      timestamp were met (aka 10 seconds within its data type).  This
      made manipulation of the timestamp very easy.  In particular, certainly
      common frequencies (like 1hz) could simply be cached in globals.

    * All 'active' storage of the timestamp was encapsulated and handled by
      the API.  In our case, one-shot and periodic SYSTIMERs.  This
      theoretically allowed timebase changes but the more I consider the
      problem the more I believe that a *BETTER* way to handle timebase
      changes is to simply use a frequency fixed at boot time and translate
      at the hardware interface.

      It turns out that if you use a single abstracted timebase type with a
      boot-time-fixed frequency in your APIs, the only place you actually
      have to translate it to the hardware timer is when you are actually
      reading or writing the hardware timer.  (more on this in the next
      point).

      In particular, take this example again with msleep(.... ticks).  In any
      system there will be many threads sleeping with a timeout.  So the
      timer queue would look something like this (abstracted, this is not a 
      data structure):

	[NEXT TIMEOUT] -> [LATER TIMEOUT] -> [LATER TIMEOUT] -> ...

      You can basically store the timeouts as absolute tick_t's without having
      to translate them to the hardware timer resolution.  Just do everything
      at tick_t's boot-time-fixed resolution.

    * It turns out that timer events that required reloading are almost always
      synchronized with the callback related to the timer event.  So instead
      of forcing the callback procedure to 'read' the exact timestamp 
      (which requires translation from the hardware timer source), we simply
      add a sysclock_t (tick_t in your case) argument to the callout procedure
      so the current time is already available to it.

      We only have to calculate the translation of the hardware timer when
      the actual hardware interrupt occurs or if a newly installed (one-shot
      or periodic reload) timeout has a smaller count then timeouts already
      queued.  Considering other overheads, even a significant number of
      mathmatical operations to do the translation are a drop in the bucket
      if you only have to do them once per timer interrupt.

      This almost completely removed all extranious multiplications and 
      divisions from the critical code paths.  

    Consider the deadline vs relative timestamp formats and the APIs that
    allow either or both very carefully.  Normalizing everything to a
    deadline in lower level APIs reaps *HUGE* benefits.  But also consider
    very carefully any intent to remove course timestamp APIs.  I personally
    believe that BOTH fine-grained and course-timestamp APIs are needed.

    You could extend your tick_t abstraction to support both course and
    fine-grained APIs as well as relative and absolute timestamps by eating
    more bits out of tick_t (maybe don't use negative numbers in that case),
    but I have to caution, again, that any such representation should be
    translated to a deadline and that low level fine-grained APIs should
    remain separate from course-grained APIS.

    This is the old 'how long do I want to msleep() for' problem... it can
    be fine-grained (microseconds) or course-grained (minutes, even).

    It probably isn't useful to represent frequencies greater then a few
    megaherz, but the limitation is not so much the highest frequency you
    want to represent but instead the maximum amount of time you want to
    represent in the data type.  A 32 bit integer with a 10-second limitation
    could represent intervals down to around 4ns (2^31 instead of 2^32 so
    you can detect deadlines which have passed).  Since cpu's are improving
    speeds laterally now rather then in raw per-core processing power, 
    and since a thread switch still takes at least 500ns even on the fastest
    cpu, I don't think it's an issue.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>