From owner-freebsd-arch@FreeBSD.ORG  Thu Mar  1 04:45:30 2012
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 34542106564A
	for <arch@freebsd.org>; Thu,  1 Mar 2012 04:45:30 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail30.syd.optusnet.com.au (mail30.syd.optusnet.com.au
	[211.29.133.193])
	by mx1.freebsd.org (Postfix) with ESMTP id C21BB8FC0C
	for <arch@freebsd.org>; Thu,  1 Mar 2012 04:45:29 +0000 (UTC)
Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au
	(c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136])
	by mail30.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q214jB7Z030524
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 1 Mar 2012 15:45:16 +1100
Date: Thu, 1 Mar 2012 15:45:11 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
In-Reply-To: <20120301132806.O2255@besplex.bde.org>
Message-ID: <20120301143042.F2406@besplex.bde.org>
References: <20120229194042.GA10921@onelab2.iet.unipi.it>
	<20120301071145.O879@besplex.bde.org>
	<20120301012315.GB14508@onelab2.iet.unipi.it>
	<20120301132806.O2255@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@freebsd.org
Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 01 Mar 2012 04:45:30 -0000

On Thu, 1 Mar 2012, Bruce Evans wrote:

> ...
> Bakul Shah confirmed that Linux now reprograms the timer.  It has to,
> for a tickless kernel.  FreeBSD reprograms timers too.  I think you
> can set HZ large and only get timeout interrupts at that frequency if
> there are active timeouts that need them.  Timeout granularity is still
> 1/HZ.

I tried this in -current and in a 2008 -current with hz=10000.  It worked
mediocrely:
- the 2008 version gave lapic cpuN: timer interrupts on all CPUs at
   frequency of almost exactly 10 kHz.  This is the behaviour before
   FreeBSD reprogrammed timers (except the frequency is often off by
   as much as 10% due to calibration bugs).  There were many anomolies
   in the results from the test program (like select() adding 199 usec
   and usleep() adding 999 usec).
- current gives cpu0: timer interrupts at a frequency of almost
   exactly 10115 Hz, but only when I watch it using systat over the
   network (10000 is Hz and the other 115 is presumaby for reprogramming).
   The other CPU gets many fewer interrupts.  When I stop watching, the
   rates drop towards 9900 for cpu0 and 120 for cpu1.  I hoped that there
   would be only about 50 timer interrupts on the mostly-idle machine.
- timeout granularity according to the test program was better than
   expected.  In almost all cases, the timeout was xx99 us.  E.g., 1
   becomes 200 after rounding up and adding 1 tick, and the result is 199
   (since there was 1 us of overhead and no jitter).  1000 became 1099
   since rounding up didn't increase it.  This is almost better than
   the OtherOS results (since it has no jitter).  I can probably easily
   beat OtherOS by setting hz to 100000.  But I think no jitter is too
   good to be good.

This makes a design bug in poll() very clear.  poll() has a timeout
granularity of 1 ms, so you can't even asks for timeouts of less than
that.  Above 1 ms, the extra 99 or 199 us is good enough, and the default
of an extra 999 or 1999 us is not too bad.

A tickless kernel should have the equivalent of HZ = 0 on idle machines
and the equivalant of HZ = huge when something uses lots of timeouts.
The latter gives some security problems.  You don't want to reprogram
timers ever 500 nsec when some untrusted application asks for timeouts
of 1000 nsec even if the system can support it.  When APIs are fixed to
catch up with 1988's timespecs, it will be possible to ask for timeouts
of 1 nsec and never get them but waste a lot of cycles.  Scheduling is
not good enough to disfavour CPU hogs that do things on the nanoseconds
scale.

I just remembered that precise timeouts are just what is needed for
hiding from schedulers.  stathz was supposed to be significantly
aperiodic and larger than hz so that CPU hogs couldn't use timeouts
(based on hz) to hide from schedulers (based on stathz).  This was
never fully implemented in FreeBSD, and was broken many years ago.
In FreeBSD, stathz was normally 128 and aperiod, and just a little
larger than hz which was normally 100.  But someone broke hz to
default to 1000.  CPU hogs can now not so easily hide from schedulers
by getting timeouts every millisecond and running for about 6 or 7
milliseconds, then sleeping for 2 or 1 millisecond to miss scheduler
ticks.  With larger hz, the hogs get more control.  E.g., HZ = 10000
lets them sleep for only 200 or 100 usec every 78.1 msec to miss
scheduler ticks.  Reprogramming of timers in -current probably gives
significant jitter to timeout boundaries.  This can be handled by
sleeping for a slightly wider interval.  Also, fine-grained timeouts
makes allows simpler implementations of this: just wake up every
tick, and if you are close to a scheduler tick (which you can predict
since they are periodic), then go back to sleep for 1 timeout tick.
Since timeout ticks are short relative to scheduler ticks, you get
control again soon and then don't have to sleep again for many
timeout ticks.  No one cares about this because CPUs are now free :-).

-current has related fixes and complications in new timer code.  Even
without malicious CPU hogs, basing statclock and hardclock on the
same lapic timer made them too synchronous with each other.  The
quick fix was to use the i8254 again.  This gave a small amount
of asynchronicity which was apparently enough to fix the non-
malicious case.  I didn't like this, and tried to generate some
fake asynchronicity in from a single lapic timer.  I think it is
possible to fake it well enough for the non-malicious case.  No
one followed up on this.  I haven't followed later developments.

Bruce