From owner-freebsd-arch@FreeBSD.ORG Wed Nov 29 22:44:02 2006 Return-Path: X-Original-To: freebsd-arch@freebsd.org Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id ACEC516A412; Wed, 29 Nov 2006 22:44:02 +0000 (UTC) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id EC8EE43CB4; Wed, 29 Nov 2006 22:43:48 +0000 (GMT) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) by apollo.backplane.com (8.13.7/8.13.4) with ESMTP id kATMhmqO048754; Wed, 29 Nov 2006 14:43:52 -0800 (PST) Received: (from dillon@localhost) by apollo.backplane.com (8.13.7/8.13.4/Submit) id kATMhmaY048753; Wed, 29 Nov 2006 14:43:48 -0800 (PST) Date: Wed, 29 Nov 2006 14:43:48 -0800 (PST) From: Matthew Dillon Message-Id: <200611292243.kATMhmaY048753@apollo.backplane.com> To: John Baldwin References: <10814.1164829546@critter.freebsd.dk> <200611291650.51782.jhb@freebsd.org> Cc: Robert Watson , freebsd-arch@freebsd.org Subject: Re: a proposed callout API X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Nov 2006 22:44:02 -0000 :Hmm, I guess that depends on what you consider tick_t to be. I was thinking :of it as an abstract type for a deadline, and that absolute and relative are :sort of like subclasses of that. Doing it that way allows you to defer on :absolute times rather than requiring whole new APIs. I assume you mean :passing a different flag to msleep(), cv_*(), sema_wait(), lockmgr(), etc. :that all take 'int timo'? If you allow it to be encoded into the tick_t, :then adding support for it just requires a new function to generate a tick_t :object and the consuming code has to learn how to handle it, but all the :in-between stuff doesn't care and doesn't have to know. :... :John Baldwin :_______________________________________________ We have something called sysclock_t that is very similar. These are the issues I encountered while implementing it: * I would recommend that the low level interfaces always operate as a deadline, because it simplifies the code enormously and makes it easy to detect negative-time events (which have to be turned into NOPs or minimally-timed events or looped events instead of sleeps). * If you intend to support negative numbers to mean relative times, the translation to a deadline should be done at the highest level. Lower levels should always ONLY operate as deadlines. For example, if you change msleep() over to use your tick_t, then allowing a negative number to indicate a relative time should be handled by msleep() and NOT by APIs at a deeper level then msleep(). This is particularly important, I think, because you do not want to get into the business of having to check what kind of timeout you are dealing with in every single API layer as you push down into the system timer, and you don't want to get into the business of having to re-read the current timestamp from hardware over and over again. (see more about this later). * The maximum amount of time that can be represented by tick_t is an issue. I decided that rather then try to represent any amount of time, I would instead stick to a 10-second limitation for our sysclock_t and only those APIs that required fine-grained timing (SYSTIMER API) would ever actually *USE* sysclock_t. This created a very explicit, well defined, and extremely visible 'wall' between fine-grained timing APIs and course-grained timing APIs. I believe it made the related source code far more readable as well. This also allowed me to use a 32 bit integer to represent the fine-grained time, which saved a great deal of memory for data storage all over the system. * I wanted an absolute fine-grained timestamp that was visible to upper layers but could be represented in 32 bits. The fact that the API was explicitly defined to cover only 10 seconds worth of (absolute or relative) time made manipulation of the timestamp extremely well defined. I also required that rollover work as expected (that you could subtract a rolled-over absolute timestamp from a prior timestamp and still get the correct delta time, within the 10-second limitation). Having the APIs explicitly assume a 2's complement rollover made all the coding easy. This worked extremely well in all respects. * The frequency didn't matter, as long as the requirements of the timestamp were met (aka 10 seconds within its data type). This made manipulation of the timestamp very easy. In particular, certainly common frequencies (like 1hz) could simply be cached in globals. * All 'active' storage of the timestamp was encapsulated and handled by the API. In our case, one-shot and periodic SYSTIMERs. This theoretically allowed timebase changes but the more I consider the problem the more I believe that a *BETTER* way to handle timebase changes is to simply use a frequency fixed at boot time and translate at the hardware interface. It turns out that if you use a single abstracted timebase type with a boot-time-fixed frequency in your APIs, the only place you actually have to translate it to the hardware timer is when you are actually reading or writing the hardware timer. (more on this in the next point). In particular, take this example again with msleep(.... ticks). In any system there will be many threads sleeping with a timeout. So the timer queue would look something like this (abstracted, this is not a data structure): [NEXT TIMEOUT] -> [LATER TIMEOUT] -> [LATER TIMEOUT] -> ... You can basically store the timeouts as absolute tick_t's without having to translate them to the hardware timer resolution. Just do everything at tick_t's boot-time-fixed resolution. * It turns out that timer events that required reloading are almost always synchronized with the callback related to the timer event. So instead of forcing the callback procedure to 'read' the exact timestamp (which requires translation from the hardware timer source), we simply add a sysclock_t (tick_t in your case) argument to the callout procedure so the current time is already available to it. We only have to calculate the translation of the hardware timer when the actual hardware interrupt occurs or if a newly installed (one-shot or periodic reload) timeout has a smaller count then timeouts already queued. Considering other overheads, even a significant number of mathmatical operations to do the translation are a drop in the bucket if you only have to do them once per timer interrupt. This almost completely removed all extranious multiplications and divisions from the critical code paths. Consider the deadline vs relative timestamp formats and the APIs that allow either or both very carefully. Normalizing everything to a deadline in lower level APIs reaps *HUGE* benefits. But also consider very carefully any intent to remove course timestamp APIs. I personally believe that BOTH fine-grained and course-timestamp APIs are needed. You could extend your tick_t abstraction to support both course and fine-grained APIs as well as relative and absolute timestamps by eating more bits out of tick_t (maybe don't use negative numbers in that case), but I have to caution, again, that any such representation should be translated to a deadline and that low level fine-grained APIs should remain separate from course-grained APIS. This is the old 'how long do I want to msleep() for' problem... it can be fine-grained (microseconds) or course-grained (minutes, even). It probably isn't useful to represent frequencies greater then a few megaherz, but the limitation is not so much the highest frequency you want to represent but instead the maximum amount of time you want to represent in the data type. A 32 bit integer with a 10-second limitation could represent intervals down to around 4ns (2^31 instead of 2^32 so you can detect deadlines which have passed). Since cpu's are improving speeds laterally now rather then in raw per-core processing power, and since a thread switch still takes at least 500ns even on the fastest cpu, I don't think it's an issue. -Matt Matthew Dillon