From owner-freebsd-hackers@FreeBSD.ORG  Fri Jul 13 12:22:20 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 7101F106566C;
	Fri, 13 Jul 2012 12:22:20 +0000 (UTC)
	(envelope-from davide.italiano@gmail.com)
Received: from mail-pb0-f54.google.com (mail-pb0-f54.google.com
	[209.85.160.54])
	by mx1.freebsd.org (Postfix) with ESMTP id 3CB318FC12;
	Fri, 13 Jul 2012 12:22:20 +0000 (UTC)
Received: by pbbro2 with SMTP id ro2so5959272pbb.13
	for <multiple recipients>; Fri, 13 Jul 2012 05:22:14 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
	h=mime-version:in-reply-to:references:date:message-id:subject:from:to
	:cc:content-type;
	bh=C7TcbJCHRDfoUyF3X26Ab8WxG2/TzmAUoXaDH1zIjUc=;
	b=ZzhcRXUp0Cv2WyNABJpHfg6U+TsmzG3srujxHxdkG3U7s8S7H4woBdKmL1H5Rwo5/n
	gB/ITqZxwt6h5Gcl+cAaY83HhWt7AGSF3sjYlXa3/uzhYGwWpfBQEjpem3kFuA1ZdSXz
	wD58Rx2yPTNC0ERWiWBuEbZpBSK/SJ7+WEnjWawraE+1bM05liK/wQxo1cGXYICxHm7w
	6ShXXwvY9J9sKU+Db1EwjoOlPDW/PKqqnUP0tYmN3ZApro8RWpfCsYQk1ya60ORkuskq
	WM90jutZr9qMDh6MPIQSBdiiz1ljk0DMusGYAs8ZG8nU5EsrLFs5GZ0wHJRPUZvHa59L
	BK4A==
MIME-Version: 1.0
Received: by 10.68.191.8 with SMTP id gu8mr3216977pbc.158.1342182134024; Fri,
	13 Jul 2012 05:22:14 -0700 (PDT)
Received: by 10.66.82.201 with HTTP; Fri, 13 Jul 2012 05:22:13 -0700 (PDT)
In-Reply-To: <201207121125.01228.jhb@freebsd.org>
References: <1342036332.8313.8.camel@albrecht-desktop>
	<201207121026.37849.jhb@freebsd.org>
	<CACYV=-G8Hip5-8G6Cyjw+GmrW3TvZMxZrKvEv7cUXuRXWyHwJw@mail.gmail.com>
	<201207121125.01228.jhb@freebsd.org>
Date: Fri, 13 Jul 2012 14:22:13 +0200
Message-ID: <CACYV=-HryymtawfW9g4LKQy0D2f1v-iyPah5JSfAFCzebhfnew@mail.gmail.com>
From: Davide Italiano <davide.italiano@gmail.com>
To: John Baldwin <jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1
Cc: Ian Lepore <freebsd@damnhippie.dyndns.org>,
	Paul Albrecht <albrecht@glccom.com>, freebsd-hackers@freebsd.org
Subject: Re: kqueue periodic timer confusion
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 13 Jul 2012 12:22:20 -0000

On Thu, Jul 12, 2012 at 5:25 PM, John Baldwin <jhb@freebsd.org> wrote:
> On Thursday, July 12, 2012 11:08:47 am Davide Italiano wrote:
>> On Thu, Jul 12, 2012 at 4:26 PM, John Baldwin <jhb@freebsd.org> wrote:
>> > On Thursday, July 12, 2012 9:57:16 am Ian Lepore wrote:
>> >> On Thu, 2012-07-12 at 08:34 -0400, John Baldwin wrote:
>> >> > On Wednesday, July 11, 2012 5:00:47 pm Ian Lepore wrote:
>> >> > > On Wed, 2012-07-11 at 14:52 -0500, Paul Albrecht wrote:
>> >> > > > Hi,
>> >> > > >
>> >> > > > Sorry about this repost but I'm confused about the responses I received
>> >> > > > in my last post so I'm looking for some clarification.
>> >> > > >
>> >> > > > Specifically, I though I could use the kqueue timer as essentially a
>> >> > > > "drop in" replacement for linuxfd_create/read, but was surprised that
>> >> > > > the accuracy of the kqueue timer is much less than what I need for my
>> >> > > > application.
>> >> > > >
>> >> > > > So my confusion at this point is whether this is consider to be a bug or
>> >> > > > "feature"?
>> >> > > >
>> >> > > > Here's some test code if you want to verify the problem:
>> >> > > >
>> >> > > > #include <stdio.h>
>> >> > > > #include <stdlib.h>
>> >> > > > #include <string.h>
>> >> > > > #include <unistd.h>
>> >> > > > #include <errno.h>
>> >> > > > #include <sys/types.h>
>> >> > > > #include <sys/event.h>
>> >> > > > #include <sys/time.h>
>> >> > > >
>> >> > > > int
>> >> > > > main(void)
>> >> > > > {
>> >> > > >         int i,msec;
>> >> > > >         int kq,nev;
>> >> > > >         struct kevent inqueue;
>> >> > > >         struct kevent outqueue;
>> >> > > >         struct timeval start,end;
>> >> > > >
>> >> > > >         if ((kq = kqueue()) == -1) {
>> >> > > >                 fprintf(stderr, "kqueue error!? errno = %s",
>> >> > strerror(errno));
>> >> > > >                 exit(EXIT_FAILURE);
>> >> > > >         }
>> >> > > >         EV_SET(&inqueue, 1, EVFILT_TIMER, EV_ADD | EV_ENABLE, 0, 20, 0);
>> >> > > >
>> >> > > >         gettimeofday(&start, 0);
>> >> > > >         for (i = 0; i < 50; i++) {
>> >> > > >                 if ((nev = kevent(kq, &inqueue, 1, &outqueue, 1, NULL)) ==
>> >> > -1) {
>> >> > > >                         fprintf(stderr, "kevent error!? errno = %s",
>> >> > strerror(errno));
>> >> > > >                         exit(EXIT_FAILURE);
>> >> > > >                 } else if (outqueue.flags & EV_ERROR) {
>> >> > > >                         fprintf(stderr, "EV_ERROR: %s\n",
>> >> > strerror(outqueue.data));
>> >> > > >                         exit(EXIT_FAILURE);
>> >> > > >                 }
>> >> > > >         }
>> >> > > >         gettimeofday(&end, 0);
>> >> > > >
>> >> > > >         msec = ((end.tv_sec - start.tv_sec) * 1000) + (((1000000 +
>> >> > end.tv_usec - start.tv_usec) / 1000) - 1000);
>> >> > > >
>> >> > > >         printf("msec = %d\n", msec);
>> >> > > >
>> >> > > >         close(kq);
>> >> > > >         return EXIT_SUCCESS;
>> >> > > > }
>> >> > > >
>> >> > > >
>> >> > >
>> >> > > What you are seeing is "just the way FreeBSD currently works."
>> >> > >
>> >> > > Sleeping (in most all of its various forms, and I've just looked at the
>> >> > > kevent code to verify this is true there) is handled by converting the
>> >> > > amount of time to sleep (usually specified in a timeval or timespec
>> >> > > struct) to a count of timer ticks, using an internal routine called
>> >> > > tvtohz() in kern/kern_time.c.  That routine rounds up by one tick to
>> >> > > account for the current tick.  Whether that's a good idea or not (it
>> >> > > probably was once, and probably not anymore) it's how things currently
>> >> > > work, and could explain the fairly consistant +1ms you're seeing.
>> >> >
>> >> > This is all true, but mostly irrelevant for his case.  EVFILT_TIMER
>> >> > installs a periodic callout that executes KNOTE() and then resets itself (via
>> >> > callout_reset()) each time it runs.  This should generally be closer to
>> >> > regulary spaced intervals than something that does:
>> >> >
>> >>
>> >> In what way is it irrelevant?  That is, what did I miss?  It appears to
>> >> me that the next callout is scheduled by calling timertoticks() passing
>> >> a count of milliseconds, that count is converted to a struct timeval and
>> >> passed to tvtohz() which is where the +1 adjustment happens.  If you ask
>> >> for 20ms and each tick is 1ms, then you'd get regular spacing of 21ms.
>> >> There is some time, likely a small number of microseconds, that you've
>> >> consumed of the current tick, and that's what the +1 in tvtohz() is
>> >> supposed to account for according to the comments.
>> >>
>> >> The tvtohz() routine both rounds up in the usual way (value+tick-1)/tick
>> >> and then adds one tick on top of that.  That seems not quite right to
>> >> me, except that it is a way to g'tee that you don't return early, and
>> >> that is the one promise made by sleep routines on any OS; those magical
>> >> "at least" words always appear in the docs.
>> >>
>> >> Actually what I'm missing (that I know of) is how the scheduler works.
>> >> Maybe the +1 adjustment to account for the fraction of the current tick
>> >> you've already consumed is the right thing to do, even when that
>> >> fraction is 1uS or less of a 1mS tick.  That would depend on scheduler
>> >> behavior that I know nothing about.
>> >
>> > Ohhhhh.  My bad, sorry.  You are correct.  It is a bug to use +1 in this
>> > case.  That is, the +1 makes sense when you are computing a one-time delta
>> > for things like nanosleep().  It is incorrect when computing a periodic
>> > delta such as for computing the interval for an itimer (setitimer) or
>> > EVFILT_TIMER().
>> >
>> > Hah, setitimer()'s callout (realitexpire) uses tvtohz - 1:
>> >
>> > sys/kern/kern_time.c:
>> >
>> > /*
>> >  * Real interval timer expired:
>> >  * send process whose timer expired an alarm signal.
>> >  * If time is not set up to reload, then just return.
>> >  * Else compute next time timer should go off which is > current time.
>> >  * This is where delay in processing this timeout causes multiple
>> >  * SIGALRM calls to be compressed into one.
>> >  * tvtohz() always adds 1 to allow for the time until the next clock
>> >  * interrupt being strictly less than 1 clock tick, but we don't want
>> >  * that here since we want to appear to be in sync with the clock
>> >  * interrupt even when we're delayed.
>> >  */
>> > void
>> > realitexpire(void *arg)
>> > {
>> >         struct proc *p;
>> >         struct timeval ctv, ntv;
>> >
>> >         p = (struct proc *)arg;
>> >         PROC_LOCK(p);
>> >         kern_psignal(p, SIGALRM);
>> >         if (!timevalisset(&p->p_realtimer.it_interval)) {
>> >                 timevalclear(&p->p_realtimer.it_value);
>> >                 if (p->p_flag & P_WEXIT)
>> >                         wakeup(&p->p_itcallout);
>> >                 PROC_UNLOCK(p);
>> >                 return;
>> >         }
>> >         for (;;) {
>> >                 timevaladd(&p->p_realtimer.it_value,
>> >                     &p->p_realtimer.it_interval);
>> >                 getmicrouptime(&ctv);
>> >                 if (timevalcmp(&p->p_realtimer.it_value, &ctv, >)) {
>> >                         ntv = p->p_realtimer.it_value;
>> >                         timevalsub(&ntv, &ctv);
>> >                         callout_reset(&p->p_itcallout, tvtohz(&ntv) - 1,
>> >                             realitexpire, p);
>> >                         PROC_UNLOCK(p);
>> >                         return;
>> >                 }
>> >         }
>> >         /*NOTREACHED*/
>> > }
>> >
>> > Paul, try this patch for sys/kern/kern_event.c.  It uses the same approach as
>> > seitimer() above:
>> >
>> > Index: kern_event.c
>> > ===================================================================
>> > --- kern_event.c        (revision 238365)
>> > +++ kern_event.c        (working copy)
>> > @@ -538,7 +538,7 @@ filt_timerexpire(void *knx)
>> >
>> >         if ((kn->kn_flags & EV_ONESHOT) != EV_ONESHOT) {
>> >                 calloutp = (struct callout *)kn->kn_hook;
>> > -               callout_reset_curcpu(calloutp, timertoticks(kn->kn_sdata),
>> > +               callout_reset_curcpu(calloutp, timertoticks(kn->kn_sdata) - 1,
>> >                     filt_timerexpire, kn);
>> >         }
>> >  }
>> >
>> > --
>> > John Baldwin
>> > _______________________________________________
>> > freebsd-hackers@freebsd.org mailing list
>> > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> > To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
>>
>> John,
>> I don't think it's good to decrease by a unit the 'ticks' you pass to
>> callout_reset_* KPI.
>> If this have to be fixed it should be fixed at the callout level and
>> not at the consumer level. In other words, subsystems that makes use
>> of callout_reset_* should not deal with the inherent limitations of
>> callout precision, as it is right now.
>
> Given that you are reworking callout already, it would seem a bit odd
> to rework callout a separate time just to fix this bug.  A simple fix
> like this (which follows the same pattern we already use for setitimer())
> is something we can easily merge back to 8 and 9.  Reworking callout in
> a different way than you are already doing, OTOH, is not something we can
> merge trivially.
>
> --
> John Baldwin

I understand what you mean. Indeed I hadn't thought about the
difficulties of merging that work back. Only a thing: if I'd were you
before committing I'd add a comment explaining the reasons of the
negative correction in the timeout value passed.

Davide