Date: Fri, 14 Dec 2012 15:42:18 +0100 From: Davide Italiano <davide@freebsd.org> To: Oliver Pinter <oliver.pntr@gmail.com> Cc: freebsd-current <freebsd-current@freebsd.org>, freebsd-arch@freebsd.org Subject: Re: [RFC/RFT] calloutng Message-ID: <CACYV=-F5bqVOqjV8AMe4%2BbE4SKstsGv4G_JqLvWf5S0CmKHRVA@mail.gmail.com> In-Reply-To: <CAPjTQNGL_7LnffWB5bbEgW0b6ekOrVzH6QQ6e2=fCFW4%2BmF6FA@mail.gmail.com> References: <CACYV=-F7_imU-JsPfeOZEyEPGKO2PVm1w1W3VdsH3jGiDvnmBg@mail.gmail.com> <CA%2BhQ2%2BgyhRHkB9Y%2BeGADvbjvJxSNSjYC%2BTQX8-0mf9LUD1V2HA@mail.gmail.com> <CACYV=-G9sG1Oo%2Bgz3kXmdeK85P7%2BZZg1CnAPLvwCuAbNftmv6A@mail.gmail.com> <CACYV=-EQ=G3JZOQ-9ExGT9spbEGtH5bJOrrgN2oeE2Qh3_rKag@mail.gmail.com> <CAPjTQNGL_7LnffWB5bbEgW0b6ekOrVzH6QQ6e2=fCFW4%2BmF6FA@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Dec 14, 2012 at 3:21 PM, Oliver Pinter <oliver.pntr@gmail.com> wrot= e: > Hi! > > 635 - return tticks; > 636 + getbinuptime(&pbt); > 637 + bt.sec =3D data / 1000; > 638 + bt.frac =3D (data % 1000) * (uint64_t)1844674407309000LL; > 639 + bintime_add(&bt, &pbt); > 640 + return bt; > 641 } > > What is this 1844674407309000LL constant? > > > 783 @@ -275,7 +288,7 @@ > 784 do { > 785 th =3D timehands; > 786 gen =3D th->th_generation; > 787 - bintime2timeval(&th->th_offset, tvp); > 788 + Bintime2timeval(&th->th_offset, tvp); > 789 } while (gen =3D=3D 0 || gen !=3D th->th_generation); > 790 } > 791 > > Capital B is there possible a typo? > Hi Oliver, thanks for reporting. Yes, both are typos. The costant is /* 18446744073709 =3D int(2^64 / 1000000) */ used to convert from timeval to bintime. > On 12/14/12, Davide Italiano <davide@freebsd.org> wrote: >> On Fri, Dec 14, 2012 at 1:57 PM, Davide Italiano <davide@freebsd.org> >> wrote: >>> On Fri, Dec 14, 2012 at 7:41 AM, Luigi Rizzo <rizzo@iet.unipi.it> wrote= : >>>> >>>> On Fri, Dec 14, 2012 at 12:12 AM, Davide Italiano <davide@freebsd.org> >>>> wrote: >>>>> >>>>> Hi. >>>>> This patch takes callout(9) and redesign the KPI and the >>>>> implementation. The main objective of this work is making the >>>>> subsystem tickless. In the last several years, this possibility has >>>>> been discussed widely (http://markmail.org/message/q3xmr2ttlzpqkmae), >>>>> but until now noone really implemented that. >>>>> If you want a complete history of what has been done in the last >>>>> months you can check the calloutng project repository >>>>> http://svnweb.freebsd.org/base/projects/calloutng/ >>>>> For lazy people, here's a summary: >>>> >>>> >>>> thanks for the work and the detailed summary. >>>> Perhaps it would be useful if you could provide a few high level >>>> details on the use and performance of the new scheme, such as: >>>> >>>> - is the old callout KPI still available ? (i am asking because it wou= ld >>>> help maintaining third party kernel modules that are expected to >>>> work on different FreeBSD releases) >>>> >>> >>> Obviously the old KPI is still available. callout(9) is a very popular >>> interface and I don't think removing the old interface is a good idea, >>> because could make unhappy some vendor when its code doesn't build >>> anymore on FreeBSD. >>> >>>> - do you have numbers on what is the fastest rate at which callouts >>>> can be fired (e.g. say you have a callout which increments a >>>> counter and schedules the next callout in (struct bintime){0,1} ) ? >>>> >> >> Right now, all the services rely on the old interface. This means they >> cannot be fired before 1 tick has elapsed, e.g. considering hz =3D 1000 >> on most of the machines, 1 millisecond. >> Now that nanosleep() relies on the new interface, we measured 4-5 >> microseconds latency for the processing before the callout is actually >> fired. I can't say if we can still lower this value, but I cannot >> imagine, for now, a consumer that actually request a shorter timeout. >> >>>> >>>> - is there a possibility that if callout requests are too close to eac= h >>>> other (e.g. the above test) the thread dispatching callouts will >>>> run forever ? if so, is there a way to make such thread yield >>>> after a while ? >>>> >> >> Most of the processing is still done in a SWI thread, "at a later >> time", so I don't think this is a problem. >> >>>> - since you mentioned nanosleep() poll() and select() have been >>>> ported to the new callout, is there a way to guarantee that user >>>> using these functions with a very short timeout are actually >>>> descheduled as opposed to "interval too short, don't bother" ? >>>> >>>> - do you have numbers on how many calls per second we can >>>> have for a process that does >>>> for (;;) { nanosleep(min_value_that_causes_descheduling); >>>> >> >> I don't follow you here. >> >>>> I also have some comments on the diff: >>>> - can you provide a diff -p ? >>>> >>>> - for several functions the only change is the name of an argument >>>> from "busy" to "us". Can you elaborate the reason for the change, >>>> and whether "us" means microseconds or the pronoun ?) >>>> >>> >>> Please see r242905 by mav@. >>> >>>> Finally, a more substantial comment: >>>> - a lot of functions which formerly had only a "timo" argument >>>> now have "timo, bt, precision, flags". Take seltdwait() as an exampl= e. >>>> >>> >>> seltdwait() is not part of the public KPI. It has been modified to >>> avoid code duplication. Having seltdwait() and seltdwait_bt(), i.e. >>> two separate functions, even though we could share most of the code is >>> not a clever approach, IMHO. >>> As I told before, seltdwait() is not exposed so we can modify its >>> argument without breaking anything. >>> >>>> It seems that you have been undecided between two approaches: >>>> for some of these functions you have preserved the original function >>>> that deals with ticks and introduced a new one that deals with the >>>> bintime, >>>> whereas in other cases you have modified the original function to ad= d >>>> "bt, precision, flags". >>>> >>> >>> I'm not. All the functions which are part of the public KPI (e.g. >>> condvar(9), sleepq(9), *) are still available. *_flags variants have >>> been introduced so that consumers can take advantage of the new >>> 'precision tolerance mechanism' implemented. Also, *_bt variants have >>> been introduced. I don't see any "undecision" between the two >>> approaches. >>> Please note that now the callout backend deals with bintime, so every >>> time callout_reset_on() is called, the 'tick' argument passed is >>> silently converted to bintime. >>> >>>> I would suggest a more uniform approach, namely: >>>> - preserve all the existing functions (T) that take a timeout in >>>> ticks; >>>> - add a new set of corresponding functions (BT) that take >>>> bt, precision, flags _instead_ of the ticks >>>> - the functions (T) make immediately the conversion from ticks to >>>> bintime(s), using macros or inline >>>> - optionally, convert kernel components to the new (BT) functions >>>> where this makes sense (e.g. we can exploit the finer-granularity >>>> of the new calls, etc.) >>>> >>> >> >> This is the strategy we followed. >> >>> >>> >>>> cheers >>>> luigi >>>> >>>> 1) callout(9) is not anymore constrained to the resolution a periodic >>>>> >>>>> "hz" clock can give. In order to do that, the eventtimers(4) subsyste= m >>>>> is used as backend. >>>>> 2) Conversely from what discussed in past, we maintained the callwhee= l >>>>> as underlying data structure for keeping track of the outstading >>>>> timeouts. This choice has a couple of advantages, in particular we ca= n >>>>> still take benefits from the O(1) average complexity of the wheel for >>>>> all the operations. Also, we thought the code duplication that would >>>>> arise from the use of a two-staged backend for callout (e.g. use whee= l >>>>> for coarse resolution event and another data structure, such as an >>>>> heap for high resolution events), is unacceptable. In fact, as long a= s >>>>> callout gained the ability to migrate from a cpu to another having a >>>>> double backend would mean doubling the code for the migration path. >>>>> 3) A way to dispatch interrupts from hardware interrupt context has >>>>> been implemented, using special callout flag. This has limited >>>>> applicability, but avoid the dispatching of a SWI thread for handling >>>>> specific callouts, avoiding the wake up of another CPU for processing >>>>> and a (relatively useless) context switch >>>>> 4) As long as new callout mechanism deals with bintime and not anymor= e >>>>> with ticks, time is specified as absolute and not relative anymore. I= n >>>>> order to get current time binuptime() or getbinuptime() is used, and = a >>>>> sysctl is introduced to selectively choose the function to use, based >>>>> on a precision threshold. >>>>> 5) A mechanism for specifying precision tolerance has been >>>>> implemented. The callout processing mechanism has been adapted and th= e >>>>> callout data structure augmented so that the codepath can take >>>>> advantage and aggregate events which overlap in time. >>>>> >>>>> >>>>> The new proposed KPI for callout is the following: >>>>> callout_reset_bt_on(..., struct bintime time, struct bintime pr, ..., >>>>> int >>>>> flags) >>>>> where =91time=92 argument represets the time at which the callout sho= uld >>>>> fire, =91pr=92 represents the precision tolerance expressed as an abs= olute >>>>> value, and =91flags=92, which could be used to specify new features, = i.e. >>>>> for now, the possibility to run the callout from fast interrupt >>>>> context. >>>>> The old KPI has been extended introducing the callout_reset_flags() >>>>> function, which is the same of callout_reset*(), but takes an >>>>> additional argument =91int flags=92 that can be used in the same fash= ion >>>>> of the =91flags=92 argument for the new KPI. Using the =91flags=92 co= nsumers >>>>> can also specify relative precision tolerance in terms of power-of-tw= o >>>>> portion of the timeout passed as ticks. >>>>> Using this strategy, the new precision mechanism can be used for the >>>>> existing services without major modifications. >>>>> >>>>> Some consumers have been ported to the new KPI, in particular >>>>> nanosleep(), poll(), select(), because they take immediate advantage >>>>> from the arbitrary precision offered by the new infrastructure. >>>>> For some statistics about the outcome of the conversion to the new >>>>> service, please refer to the end of this e-mail: >>>>> http://lists.freebsd.org/pipermail/freebsd-arch/2012-July/012756.html >>>>> We didn't measure any significant performance regressions with >>>>> hwmpc(4), using some benckmarks programs: >>>>> http://people.freebsd.org/~davide/poll_test/poll_test.c >>>>> http://people.freebsd.org/~mav/testsleep.c >>>>> http://people.freebsd.org/~mav/testidle.c >>>>> >>>>> We tested the code on amd64, MIPS and arm. Any kind of testing or >>>>> comment would be really appreciated. The full diff of the work agains= t >>>>> HEAD can be found at: http://people.freebsd.org/~davide/calloutng.dif= f >>>>> If noone have objections, we plan to merge the repository to HEAD in = a >>>>> week or so. >>>>> >>>>> Thanks, >>>>> >>>>> Davide >>>>> _______________________________________________ >>>>> freebsd-current@freebsd.org mailing list >>>>> http://lists.freebsd.org/mailman/listinfo/freebsd-current >>>>> To unsubscribe, send any mail to >>>>> "freebsd-current-unsubscribe@freebsd.org" >>>> >>>> >>>> >>>> >>>> -- >>>> -----------------------------------------+----------------------------= --- >>>> Prof. Luigi RIZZO, rizzo@iet.unipi.it . Dip. di Ing. dell'Informazio= ne >>>> http://www.iet.unipi.it/~luigi/ . Universita` di Pisa >>>> TEL +39-050-2211611 . via Diotisalvi 2 >>>> Mobile +39-338-6809875 . 56122 PISA (Italy) >>>> -----------------------------------------+----------------------------= --- >>>> >> _______________________________________________ >> freebsd-current@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-current >> To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.or= g" >>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CACYV=-F5bqVOqjV8AMe4%2BbE4SKstsGv4G_JqLvWf5S0CmKHRVA>