From owner-freebsd-current@FreeBSD.ORG Fri Dec 14 13:13:07 2012 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 651F862E; Fri, 14 Dec 2012 13:13:07 +0000 (UTC) (envelope-from davide.italiano@gmail.com) Received: from mail-vc0-f182.google.com (mail-vc0-f182.google.com [209.85.220.182]) by mx1.freebsd.org (Postfix) with ESMTP id E054C8FC17; Fri, 14 Dec 2012 13:13:06 +0000 (UTC) Received: by mail-vc0-f182.google.com with SMTP id fy27so1677467vcb.13 for ; Fri, 14 Dec 2012 05:13:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type :content-transfer-encoding; bh=2q+HuqqGX8DRlnglV+Ns18Qpoent+zbqkLMHF+3sPJ0=; b=hSlMOjCkOy3XLqSatRzGY93GCVux7B46Fogzjjvv7cCs+1UE5WGYTMCMDEXfoYQjgQ 8L0O+lXEYr/FcGtlPkiBcFtbtUiTCaw6Spju3IZRRWcdcwRB84pGrCrz4VgM41wwiidb HeaU509NMARDqIIF1s+l1tBYTHytaAN5CePFu9RsDQuPJp+FKHnJYTb+FL65RhMiX24p me2EaStiHZ0t0REaiQVhlxhv1Dnvz5ppEkKfhFEEPx6qgEnuq/Lc/7FxN9Prh+kX+dHB vAd3PkunhpNOO4MxLS6Zc0sLFJDPIFOR1/tXyqKbgHpdOhaHmnyi9NgvIPfBuPKT9tzO sRLQ== MIME-Version: 1.0 Received: by 10.52.92.139 with SMTP id cm11mr7744830vdb.85.1355490785961; Fri, 14 Dec 2012 05:13:05 -0800 (PST) Sender: davide.italiano@gmail.com Received: by 10.58.245.130 with HTTP; Fri, 14 Dec 2012 05:13:05 -0800 (PST) In-Reply-To: References: Date: Fri, 14 Dec 2012 14:13:05 +0100 X-Google-Sender-Auth: fVarY5sYh8B5WuY61x-2-2K7Uvo Message-ID: Subject: Re: [RFC/RFT] calloutng From: Davide Italiano To: Luigi Rizzo Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Cc: freebsd-current , freebsd-arch@freebsd.org X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Dec 2012 13:13:07 -0000 On Fri, Dec 14, 2012 at 1:57 PM, Davide Italiano wrote= : > On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo wrote: >> >> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano >> wrote: >>> >>> Hi. >>> This patch takes callout(9) and redesign the KPI and the >>> implementation. The main objective of this work is making the >>> subsystem tickless. In the last several years, this possibility has >>> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae), >>> but until now noone really implemented that. >>> If you want a complete history of what has been done in the last >>> months you can check the calloutng project repository >>> http://svnweb.freebsd.org/base/projects/calloutng/ >>> For lazy people, here's a summary: >> >> >> thanks for the work and the detailed summary. >> Perhaps it would be useful if you could provide a few high level >> details on the use and performance of the new scheme, such as: >> >> - is the old callout KPI still available ? (i am asking because it would >> help maintaining third party kernel modules that are expected to >> work on different FreeBSD releases) >> > > Obviously the old KPI is still available. callout(9) is a very popular > interface and I don't think removing the old interface is a good idea, > because could make unhappy some vendor when its code doesn't build > anymore on FreeBSD. > >> - do you have numbers on what is the fastest rate at which callouts >> can be fired (e.g. say you have a callout which increments a >> counter and schedules the next callout in (struct bintime){0,1} ) ? >> Right now, all the services rely on the old interface. This means they cannot be fired before 1 tick has elapsed, e.g. considering hz =3D 1000 on most of the machines, 1 millisecond. Now that nanosleep() relies on the new interface, we measured 4-5 microseconds latency for the processing before the callout is actually fired. I can't say if we can still lower this value, but I cannot imagine, for now, a consumer that actually request a shorter timeout. >> >> - is there a possibility that if callout requests are too close to each >> other (e.g. the above test) the thread dispatching callouts will >> run forever ? if so, is there a way to make such thread yield >> after a while ? >> Most of the processing is still done in a SWI thread, "at a later time", so I don't think this is a problem. >> - since you mentioned nanosleep() poll() and select() have been >> ported to the new callout, is there a way to guarantee that user >> using these functions with a very short timeout are actually >> descheduled as opposed to "interval too short, don't bother" ? >> >> - do you have numbers on how many calls per second we can >> have for a process that does >> for (;;) { nanosleep(min_value_that_causes_descheduling); >> I don't follow you here. >> I also have some comments on the diff: >> - can you provide a diff -p ? >> >> - for several functions the only change is the name of an argument >> from "busy" to "us". Can you elaborate the reason for the change, >> and whether "us" means microseconds or the pronoun ?) >> > > Please see r242905 by mav@. > >> Finally, a more substantial comment: >> - a lot of functions which formerly had only a "timo" argument >> now have "timo, bt, precision, flags". Take seltdwait() as an example. >> > > seltdwait() is not part of the public KPI. It has been modified to > avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e. > two separate functions, even though we could share most of the code is > not a clever approach, IMHO. > As I told before, seltdwait() is not exposed so we can modify its > argument without breaking anything. > >> It seems that you have been undecided between two approaches: >> for some of these functions you have preserved the original function >> that deals with ticks and introduced a new one that deals with the >> bintime, >> whereas in other cases you have modified the original function to add >> "bt, precision, flags". >> > > I'm not. All the functions which are part of the public KPI (e.g. > condvar(9), sleepq(9), *) are still available. *_flags variants have > been introduced so that consumers can take advantage of the new > 'precision tolerance mechanism' implemented. Also, *_bt variants have > been introduced. I don't see any "undecision" between the two > approaches. > Please note that now the callout backend deals with bintime, so every > time callout_reset_on() is called, the 'tick' argument passed is > silently converted to bintime. > >> I would suggest a more uniform approach, namely: >> - preserve all the existing functions (T) that take a timeout in ticks= ; >> - add a new set of corresponding functions (BT) that take >> bt, precision, flags _instead_ of the ticks >> - the functions (T) make immediately the conversion from ticks to >> bintime(s), using macros or inline >> - optionally, convert kernel components to the new (BT) functions >> where this makes sense (e.g. we can exploit the finer-granularity >> of the new calls, etc.) >> > This is the strategy we followed. > > >> cheers >> luigi >> >> 1) callout(9) is not anymore constrained to the resolution a periodic >>> >>> "hz" clock can give. In order to do that, the eventtimers(4) subsystem >>> is used as backend. >>> 2) Conversely from what discussed in past, we maintained the callwheel >>> as underlying data structure for keeping track of the outstading >>> timeouts. This choice has a couple of advantages, in particular we can >>> still take benefits from the O(1) average complexity of the wheel for >>> all the operations. Also, we thought the code duplication that would >>> arise from the use of a two-staged backend for callout (e.g. use wheel >>> for coarse resolution event and another data structure, such as an >>> heap for high resolution events), is unacceptable. In fact, as long as >>> callout gained the ability to migrate from a cpu to another having a >>> double backend would mean doubling the code for the migration path. >>> 3) A way to dispatch interrupts from hardware interrupt context has >>> been implemented, using special callout flag. This has limited >>> applicability, but avoid the dispatching of a SWI thread for handling >>> specific callouts, avoiding the wake up of another CPU for processing >>> and a (relatively useless) context switch >>> 4) As long as new callout mechanism deals with bintime and not anymore >>> with ticks, time is specified as absolute and not relative anymore. In >>> order to get current time binuptime() or getbinuptime() is used, and a >>> sysctl is introduced to selectively choose the function to use, based >>> on a precision threshold. >>> 5) A mechanism for specifying precision tolerance has been >>> implemented. The callout processing mechanism has been adapted and the >>> callout data structure augmented so that the codepath can take >>> advantage and aggregate events which overlap in time. >>> >>> >>> The new proposed KPI for callout is the following: >>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., i= nt >>> flags) >>> where =91time=92 argument represets the time at which the callout shoul= d >>> fire, =91pr=92 represents the precision tolerance expressed as an absol= ute >>> value, and =91flags=92, which could be used to specify new features, i.= e. >>> for now, the possibility to run the callout from fast interrupt >>> context. >>> The old KPI has been extended introducing the callout_reset_flags() >>> function, which is the same of callout_reset*(), but takes an >>> additional argument =91int flags=92 that can be used in the same fashio= n >>> of the =91flags=92 argument for the new KPI. Using the =91flags=92 cons= umers >>> can also specify relative precision tolerance in terms of power-of-two >>> portion of the timeout passed as ticks. >>> Using this strategy, the new precision mechanism can be used for the >>> existing services without major modifications. >>> >>> Some consumers have been ported to the new KPI, in particular >>> nanosleep(), poll(), select(), because they take immediate advantage >>> from the arbitrary precision offered by the new infrastructure. >>> For some statistics about the outcome of the conversion to the new >>> service, please refer to the end of this e-mail: >>> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html >>> We didn't measure any significant performance regressions with >>> hwmpc(4), using some benckmarks programs: >>> http://people.freebsd.org/~davide/poll_test/poll_test.c >>> http://people.freebsd.org/~mav/testsleep.c >>> http://people.freebsd.org/~mav/testidle.c >>> >>> We tested the code on amd64, MIPS and arm. Any kind of testing or >>> comment would be really appreciated. The full diff of the work against >>> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.diff >>> If noone have objections, we plan to merge the repository to HEAD in a >>> week or so. >>> >>> Thanks, >>> >>> Davide >>> _______________________________________________ >>> freebsd-current@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-current >>> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.o= rg" >> >> >> >> >> -- >> -----------------------------------------+------------------------------= - >> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazione >> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >> TEL +39-050-2211611 . via Diotisalvi 2 >> Mobile +39-338-6809875 . 56122 PISA (Italy) >> -----------------------------------------+------------------------------= - >>