From owner-freebsd-current@FreeBSD.ORG Fri Dec 14 14:21:56 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id D48CD376; Fri, 14 Dec 2012 14:21:56 +0000 (UTC) (envelope-from oliver.pntr@gmail.com) Received: from mail-oa0-f54.google.com (mail-oa0-f54.google.com [209.85.219.54]) by mx1.freebsd.org (Postfix) with ESMTP id 6EA958FC17; Fri, 14 Dec 2012 14:21:56 +0000 (UTC) Received: by mail-oa0-f54.google.com with SMTP id n9so3581454oag.13 for ; Fri, 14 Dec 2012 06:21:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type:content-transfer-encoding; bh=97x28J9YvCya+h3C2TFfcQ+VeL2iGwlkcR8md7mgS3s=; b=W3g+WsXcYheGDMTGnOG2aUZhxjOInZABdVHMFN9+6ZEn9RDoMKb6tjfTUGFtgAochc Wcx3Tu6hqK5Bkki+r8G9mOYeS2quX4UFZiliMgNuUihDIgbV+JY5C6KFLb+ux5HnrJmC 5bAueKBfzYI0TP+bKUP/PMAMvOrAyvtrpFd/afp/Oe3rp5wIY/wAdzuwrqYwTouTX1uv iJhtnMGwIIflSmCJ7e6XbvUbekfVusmpXIYMPhBREV0kGUJNAZwKBoloPRiPhdk+ApiB KClC3CIW3+JZMjIj33plrkKr4NlsVGIKgGKd0HByBkK6SgmF1ZZIktHRIfgFxFBeTxkd XKXA== MIME-Version: 1.0 Received: by 10.182.17.72 with SMTP id m8mr4614135obd.55.1355494915778; Fri, 14 Dec 2012 06:21:55 -0800 (PST) Received: by 10.76.34.227 with HTTP; Fri, 14 Dec 2012 06:21:55 -0800 (PST) In-Reply-To: References: Date: Fri, 14 Dec 2012 15:21:55 +0100 Message-ID: Subject: Re: [RFC/RFT] calloutng From: Oliver Pinter To: Davide Italiano Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current , Luigi Rizzo , freebsd-arch@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Dec 2012 14:21:57 -0000 Hi! 635 - return tticks; 636 + getbinuptime(&pbt); 637 + bt.sec =3D data / 1000; 638 + bt.frac =3D (data % 1000) * (uint64_t)1844674407309000LL; 639 + bintime_add(&bt, &pbt); 640 + return bt; 641 } What is this 1844674407309000LL constant? 783 @@ -275,7 +288,7 @@ 784 do { 785 th =3D timehands; 786 gen =3D th->th_generation; 787 - bintime2timeval(&th->th_offset, tvp); 788 + Bintime2timeval(&th->th_offset, tvp); 789 } while (gen =3D=3D 0 || gen !=3D th->th_generation); 790 } 791 Capital B is there possible a typo? On 12/14/12, Davide Italiano wrote: > On Fri, Dec 14, 2012 at 1:57 PM, Davide Italiano > wrote: >> On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: >>> >>> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano >>> wrote: >>>> >>>> Hi. >>>> This patch takes callout(9) and redesign the KPI and the >>>> implementation. The main objective of this work is making the >>>> subsystem tickless. In the last several years, this possibility has >>>> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae), >>>> but until now noone really implemented that. >>>> If you want a complete history of what has been done in the last >>>> months you can check the calloutng project repository >>>> http://svnweb.freebsd.org/base/projects/calloutng/ >>>> For lazy people, here's a summary: >>> >>> >>> thanks for the work and the detailed summary. >>> Perhaps it would be useful if you could provide a few high level >>> details on the use and performance of the new scheme, such as: >>> >>> - is the old callout KPI still available ? (i am asking because it woul= d >>> help maintaining third party kernel modules that are expected to >>> work on different FreeBSD releases) >>> >> >> Obviously the old KPI is still available. callout(9) is a very popular >> interface and I don't think removing the old interface is a good idea, >> because could make unhappy some vendor when its code doesn't build >> anymore on FreeBSD. >> >>> - do you have numbers on what is the fastest rate at which callouts >>> can be fired (e.g. say you have a callout which increments a >>> counter and schedules the next callout in (struct bintime){0,1} ) ? >>> > > Right now, all the services rely on the old interface. This means they > cannot be fired before 1 tick has elapsed, e.g. considering hz =3D 1000 > on most of the machines, 1 millisecond. > Now that nanosleep() relies on the new interface, we measured 4-5 > microseconds latency for the processing before the callout is actually > fired. I can't say if we can still lower this value, but I cannot > imagine, for now, a consumer that actually request a shorter timeout. > >>> >>> - is there a possibility that if callout requests are too close to each >>> other (e.g. the above test) the thread dispatching callouts will >>> run forever ? if so, is there a way to make such thread yield >>> after a while ? >>> > > Most of the processing is still done in a SWI thread, "at a later > time", so I don't think this is a problem. > >>> - since you mentioned nanosleep() poll() and select() have been >>> ported to the new callout, is there a way to guarantee that user >>> using these functions with a very short timeout are actually >>> descheduled as opposed to "interval too short, don't bother" ? >>> >>> - do you have numbers on how many calls per second we can >>> have for a process that does >>> for (;;) { nanosleep(min_value_that_causes_descheduling); >>> > > I don't follow you here. > >>> I also have some comments on the diff: >>> - can you provide a diff -p ? >>> >>> - for several functions the only change is the name of an argument >>> from "busy" to "us". Can you elaborate the reason for the change, >>> and whether "us" means microseconds or the pronoun ?) >>> >> >> Please see r242905 by mav@. >> >>> Finally, a more substantial comment: >>> - a lot of functions which formerly had only a "timo" argument >>> now have "timo, bt, precision, flags". Take seltdwait() as an example= . >>> >> >> seltdwait() is not part of the public KPI. It has been modified to >> avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e. >> two separate functions, even though we could share most of the code is >> not a clever approach, IMHO. >> As I told before, seltdwait() is not exposed so we can modify its >> argument without breaking anything. >> >>> It seems that you have been undecided between two approaches: >>> for some of these functions you have preserved the original function >>> that deals with ticks and introduced a new one that deals with the >>> bintime, >>> whereas in other cases you have modified the original function to add >>> "bt, precision, flags". >>> >> >> I'm not. All the functions which are part of the public KPI (e.g. >> condvar(9), sleepq(9), *) are still available. *_flags variants have >> been introduced so that consumers can take advantage of the new >> 'precision tolerance mechanism' implemented. Also, *_bt variants have >> been introduced. I don't see any "undecision" between the two >> approaches. >> Please note that now the callout backend deals with bintime, so every >> time callout_reset_on() is called, the 'tick' argument passed is >> silently converted to bintime. >> >>> I would suggest a more uniform approach, namely: >>> - preserve all the existing functions (T) that take a timeout in >>> ticks; >>> - add a new set of corresponding functions (BT) that take >>> bt, precision, flags _instead_ of the ticks >>> - the functions (T) make immediately the conversion from ticks to >>> bintime(s), using macros or inline >>> - optionally, convert kernel components to the new (BT) functions >>> where this makes sense (e.g. we can exploit the finer-granularity >>> of the new calls, etc.) >>> >> > > This is the strategy we followed. > >> >> >>> cheers >>> luigi >>> >>> 1) callout(9) is not anymore constrained to the resolution a periodic >>>> >>>> "hz" clock can give. In order to do that, the eventtimers(4) subsystem >>>> is used as backend. >>>> 2) Conversely from what discussed in past, we maintained the callwheel >>>> as underlying data structure for keeping track of the outstading >>>> timeouts. This choice has a couple of advantages, in particular we can >>>> still take benefits from the O(1) average complexity of the wheel for >>>> all the operations. Also, we thought the code duplication that would >>>> arise from the use of a two-staged backend for callout (e.g. use wheel >>>> for coarse resolution event and another data structure, such as an >>>> heap for high resolution events), is unacceptable. In fact, as long as >>>> callout gained the ability to migrate from a cpu to another having a >>>> double backend would mean doubling the code for the migration path. >>>> 3) A way to dispatch interrupts from hardware interrupt context has >>>> been implemented, using special callout flag. This has limited >>>> applicability, but avoid the dispatching of a SWI thread for handling >>>> specific callouts, avoiding the wake up of another CPU for processing >>>> and a (relatively useless) context switch >>>> 4) As long as new callout mechanism deals with bintime and not anymore >>>> with ticks, time is specified as absolute and not relative anymore. In >>>> order to get current time binuptime() or getbinuptime() is used, and a >>>> sysctl is introduced to selectively choose the function to use, based >>>> on a precision threshold. >>>> 5) A mechanism for specifying precision tolerance has been >>>> implemented. The callout processing mechanism has been adapted and the >>>> callout data structure augmented so that the codepath can take >>>> advantage and aggregate events which overlap in time. >>>> >>>> >>>> The new proposed KPI for callout is the following: >>>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., >>>> int >>>> flags) >>>> where =91time=92 argument represets the time at which the callout shou= ld >>>> fire, =91pr=92 represents the precision tolerance expressed as an abso= lute >>>> value, and =91flags=92, which could be used to specify new features, i= .e. >>>> for now, the possibility to run the callout from fast interrupt >>>> context. >>>> The old KPI has been extended introducing the callout_reset_flags() >>>> function, which is the same of callout_reset*(), but takes an >>>> additional argument =91int flags=92 that can be used in the same fashi= on >>>> of the =91flags=92 argument for the new KPI. Using the =91flags=92 con= sumers >>>> can also specify relative precision tolerance in terms of power-of-two >>>> portion of the timeout passed as ticks. >>>> Using this strategy, the new precision mechanism can be used for the >>>> existing services without major modifications. >>>> >>>> Some consumers have been ported to the new KPI, in particular >>>> nanosleep(), poll(), select(), because they take immediate advantage >>>> from the arbitrary precision offered by the new infrastructure. >>>> For some statistics about the outcome of the conversion to the new >>>> service, please refer to the end of this e-mail: >>>> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html >>>> We didn't measure any significant performance regressions with >>>> hwmpc(4), using some benckmarks programs: >>>> http://people.freebsd.org/~davide/poll_test/poll_test.c >>>> http://people.freebsd.org/~mav/testsleep.c >>>> http://people.freebsd.org/~mav/testidle.c >>>> >>>> We tested the code on amd64, MIPS and arm. Any kind of testing or >>>> comment would be really appreciated. The full diff of the work against >>>> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.diff >>>> If noone have objections, we plan to merge the repository to HEAD in a >>>> week or so. >>>> >>>> Thanks, >>>> >>>> Davide >>>> _______________________________________________ >>>> freebsd-current@freebsd.org mailing list >>>> http://lists.freebsd.org/mailman/listinfo/freebsd-current >>>> To unsubscribe, send any mail to >>>> "freebsd-current-unsubscribe@freebsd.org" >>> >>> >>> >>> >>> -- >>> -----------------------------------------+-----------------------------= -- >>> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazion= e >>> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >>> TEL +39-050-2211611 . via Diotisalvi 2 >>> Mobile +39-338-6809875 . 56122 PISA (Italy) >>> -----------------------------------------+-----------------------------= -- >>> > _______________________________________________ > freebsd-current@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org= " >