From owner-freebsd-arch@FreeBSD.ORG  Wed Feb 29 19:41:20 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 8B9301065675
	for <arch@freebsd.org>; Wed, 29 Feb 2012 19:41:20 +0000 (UTC)
	(envelope-from luigi@onelab2.iet.unipi.it)
Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238])
	by mx1.freebsd.org (Postfix) with ESMTP id 323EB8FC08
	for <arch@freebsd.org>; Wed, 29 Feb 2012 19:41:19 +0000 (UTC)
Received: by onelab2.iet.unipi.it (Postfix, from userid 275)
	id 6FCC17300B; Wed, 29 Feb 2012 20:40:42 +0100 (CET)
Date: Wed, 29 Feb 2012 20:40:42 +0100
From: Luigi Rizzo <rizzo@iet.unipi.it>
To: arch@freebsd.org
Message-ID: <20120229194042.GA10921@onelab2.iet.unipi.it>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="EVF5PPMfhYS0aIcm"
Content-Disposition: inline
User-Agent: Mutt/1.4.2.3i
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: 
Subject: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Feb 2012 19:41:20 -0000


--EVF5PPMfhYS0aIcm
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I have always been annoyed by the fact that FreeBSD rounds timeouts
in select/usleep/poll in very conservative ways, so i decided to
try how other systems behave in this respect. Attached is a simple
program that you should be able to compile and run on various OS
and see what happens.

Here are the results (HZ=1000 on the system under test, and FreeBSD
has the same behaviour since at least 4.11):

	        |    Actual timeout
                |      select            | poll  | usleep|
	timeout | FBSD  | Linux | OSX    | FBSD  | FBSD  |
	usec    | 9.0   | Vbox  | 10.6   |  9.0  |  9.0  |
	--------+-------+-------+--------+-------+-------+
	    1      2000      99       6     0      2000
	   10      2000     109      15     0      2000
	   50      2000     149      66     0      2000
	  100      2000     196     133     0      2000
	  500      2000     597     617     0      2000
	 1000      2000    1103    1136    2000    2000
	 1001      3000    1103    1136    2000    3000 <---
	 1500      3000    1608    1631    2000    3000 <---
         2000	   3000    2096    2127    3000    3000
	 2001	   4000                    3000    4000 <---
	 3001	   5000                    4000    5000 <---


Note how the rounding (poll has the timeout in milliseconds) affects
the actual timeouts when you are past multiples of 1/HZ.

I know that until we have some hi-res interrupt source there is no
hope to have better than 1/HZ granularity. However we are doing
much worse by adding up to 2 extra ticks. This makes apps less
responsive than they could be, and gives us no way to
"yield until the next tick".

So what I would like to do is add a sysctl (disabled by
default) that enables a better approximation of the desired delay.

I see in the kernel that all three syscalls loop around a blocking
function (tsleep or seltdwait), and do check the "actual" elapsed
time by calling getmicrouptime() or getnanouptime() around the
sleeping function .  So the actual timeout passed to tsleep does
not really matter (as long as it is greater than 0 ).

The only concern is that getmicrouptime()/getnanouptime() are documented
as "less precise, but faster to obtain". The question is how precise is
"less precise": do we have some way to get an upper bound for the
precision of the timers used in get*time(), so we can use that value
in the equation instead of the extra 1/HZ that tvtohz() puts in
after computing floor(timeout*HZ) ?


For reference, below is the core of usleep and select/poll
(from kern_time.c and sys_generic.c)

    usleep:
	getnanouptime(now)
	end = now + timeout;
	for (;;) {
		getnanouptime(now);
		delta = end - now;
		if (delta <= 0)
			break;
		tsleep(..., tvtohz(delta) )
	}

    select/poll:
	itimerfix(timeout) // force at least 1/HZ
	getmicrouptime(now)
	end = now + timeout;
	for (;;) {
		delta = end - now;
		seltdwait(..., tvtohz(delta) )
		getmicrouptime(now);
		if (some_fd_is_ready() || now >= end)
			break;
	}

---

cheers
luigi

--EVF5PPMfhYS0aIcm--