Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 24 Jun 2012 16:36:24 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alexander Motin <mav@FreeBSD.org>
Cc:        Davide Italiano <davide@FreeBSD.org>, src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, svn-src-head@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>, Marius Strobl <marius@alchemy.franken.de>, Konstantin Belousov <kostikbel@gmail.com>
Subject:   Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys
Message-ID:  <20120624142958.C850@besplex.bde.org>
In-Reply-To: <4FE6209B.7050809@FreeBSD.org>
References:  <201206220713.q5M7DVH0063098@svn.freebsd.org> <20120622073455.GE69382@alchemy.franken.de> <20120622074817.GA2337@deviant.kiev.zoral.com.ua> <20120623131757.GB46065@alchemy.franken.de> <20120623140556.GU2337@deviant.kiev.zoral.com.ua> <20120624005418.W2417@besplex.bde.org> <4FE6209B.7050809@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jun 2012, Alexander Motin wrote:

> On 06/23/12 18:26, Bruce Evans wrote:
>> On Sat, 23 Jun 2012, Konstantin Belousov wrote:
>>> On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote:
>>>> So apart from introducing code to constantly synchronize the
>>>> TICK counters, using the timecounters on the host busses also
>>>> seems to be the only viable solution for userland. The latter
>>>> should be doable but is long-winded as besides duplicating
>>>> portions of the corresponding device drivers in userland, it
>>>> probably also means to get some additional infrastructure
>>>> like being able to memory map registers for devices on the
>>>> nexus(4) level in place ...
>> 
>> There is little point in optimizations to avoid syscalls for hardware.
>> On x86, a syscall takes 100-400 nsec extra, so if the hardware takes
>> 500-2000 nsec then reduction the total time by 100-400 nsec is not
>> very useful.
>
> Just out of curiosity I've run my own binuptime() micro-benchmarks:
> - on Core i5-650:
>  TSC		  11ns
>  HPET	   	 433ns
>  ACPI-fast	 515ns
>  i8254	  	3736ns

The TSC is surprisingly fast and the others are depressingly slow,
although about the fastest I've seen for bus-based timecounters.

On Athlon64, rdtsc() takes 6.5 cycles, but I thought all P-state
invariant TSCs took > 40 cycles.  rdtsc() takes 65 cycles on FreeBSD
x86 cluster machines (core2 Xeon), except on freefall (P4(?) Xeon).

I hardly believe 11ns.  That's 44 cycles at 4GHz.  IIRC, the Athlon64
at 2.2GHz took 29nsec for binuptime() last time I measured it (long
ago, when it still had the statistics counter pessimization).

> - on dual-socket Xeon E5645:
>  TSC	    	  15ns
>  HPET	   	 580ns
>  ACPI-fast	1118ns
>  i8254	  	3911ns
>
> I think it could be useful to have that small benchmark in base kernel.

I think kib put one in src/tools for userland.  I mostly use a userland
one.  Except for the TSC, the overhead for the kernel parts can be
estimate accurately from userland, since it is so large.

This is more normal slowness for ACPI-[!]fast.  freefall still uses
ACPI-fast and it takes a minimum of 1396 and an average of 1729nsec
from usrerland (load average 1.3).  Other x86 cluster machines now use
TSC-[s]low, and it takes a minimum of 481 and an average of 533nsec
(now the swing from 481 to 533 is given by its gratuitous impreciseness
and not by system load).

BTW, the i8254 timecounter can be made about 3/2 times faster if anyone
cared, by reading only the low 8 bits of the timer.  This would require
running clock interrupts at >= 4kHz so that the top 8 bits are rarely
needed (great for a tickless kernel :-), or maybe by using a fuzzier
timer to determine when the top bits are needed.  At ~2500ns, it would
be only slightly slower than the slowest ACPI-fast, and faster than
ACPI-safe.

OTOH, I have measured i8254 timer reads taking 138000ns (on UP with
interrupts disabled) on a system where they normally take only 4000ns.
Apparently the ISA bus waits for other bus activity (DMA?) for that
long.  Does this happen for other buses?  Extra bridges for ISA can't
help.

>> ...
>> The new timeout code to support tickless kernels looks like it will give
>> large pessimizations unless the timecounter is fast.  Instead of using
>> the tick counter (1 atomic increment on every clock tick) and some
>> getbinuptime() calls in places like select(), it uses the hardware
>> timecounter via binuptime() in most places (since without a tick counter
>> and without clock interrupts updating the timehands periodically, it takes
>> a hardware timecounter read to determine the time).  So callout_reset()
>> might start taking thousands of nsec for per call, depending on how slow
>> the timecounter is.  This fix is probably to use a fuzzy time for long
>> long timeouts and to discourage use of short timeouts and/or to turn them
>> into long or fuzzy timeouts so that they are not very useful.
>
> The new timeout code is still in active development and optimization was not 
> the first priority yet. My idea was to use much faster getbinuptime() for 
> periods above let's say 100ms.

You would need to run non-tickless with a clock interrupt frequency
of >= 10Hz to keep getbinuptime() working.  Seems like a bad thing to
aim for.  Better not use bintimes at all.  I would try using
pseudo-ticks, (where the tick counter is advanced on every
not-very-periodic clock interrupt and at some other times when you
know that clock interrupts have been stopped, and maybe at other
interesting places (all interrupts and all syscalls?)).  Only call
binuptime() every few thousand pseudo-ticks to prevent long-term drift.
Timeouts would become longer and fuzzier than now, but that is a feature
(it inhibits using them for busy-waiting).  You know when you scheduled
clock interrupts and can advance the tick counter to represent the
interval between clock interrupts fairly accurately (say to within 10%).
The fuzziness comes mainly from not scheduling clock interrupts very
often, so that for example when something asks for a sleep of 1 tick
now, it might take 100 times longer because there isn't a clock interrupt
for 100 times longer.  You also should schedule clock interrupts just
because something asks for a short timeout.

> Legacy ticks-oriented callout_reset() 
> functions are by default not supposed to provide sub-tick resolution and with 
> some assumptions could use getbinuptime(). For new interfaces it depends on 
> caller, how will it get present time.

Even 1 tick is too short.  Using binuptime() encourages asking for much
shorter intervals.  Even for long sleeps, many places try to micro-sleep
for the residual time after waking up early.  E.g., nanotime(), select()
and poll().  These places can also ask for an initial sleep with a
resolution of nsec, usec or msec, respectively.  If the timeout code
actually honors these requests, then it would generate lots of clock
interrupts and even more overheads by allowing more timeouts to
actually expire.  OTOH, supporting nano-sleeps allows nanotime() to
actually approach its name.


> I understand that integer tick counter is as fast as nothing else can ever 
> be. But sorry, 32bit counter doesn't fit present goals. To have more we need

On the contrary, it becomes more adequate than with periodic ticks, since
you need to reduce the tick frequency, so 32 bit works for longer.

> some artificial atomicity -- exactly what getbinuptime() implements.

Why would you need any atomicity?  Timeouts become fuzzier (because you
can't afford to generate clock interrupts to keep them as short as
possible, and want to generate even fewer clock interrupts than now).
Who cares if a non-atomic comparison results on more fuzziness.  OTOH,
if a timeout actually expires, it would be good to maintain the invariant
that it never expires early, and some sort of clock that is known to
never run fast (relative to all earlier times measured on it) is needed
to ensure this, and some atomicity is also required for this.  I think
most timeouts never expire (because most are for emergency conditions),
so checking the time accurately only when a timeout expires according
to a fuzzy clock may be efficient enouugh.  The problem is the initial
time read for converting a relative time to an absolute expiry time --
unless that is accurate, the expiry time is fuzzy.

> What I 
> would like to see there is tc_tick removal to make tc_windup() called for 
> every hardclock tick.

That would break it.  People can set HZ to 10KHz or more (I once tried
1MHz with lapic_timer, and it worked more or less correctly).  Calling
it that often would make the timehands cycle too fast, and tc_tick is
used to prevent this.  You could "fix" this by increasing the number
of timehands from 10 to 1000[0..] according to HZ.

> Having new tick-irrelevant callout interfaces we 
> probably won't so much need to increase HZ too high any more, while this 
> simplification would make ticks and getbinuptime() precision equal, solving 
> some of your valid arguments against the last.

People increase HZ to do bogus polling.  A periodic timer for this is
at least as efficient as possible.  The periodic timer for this is
hung off hardclock() (hardclock() calls hardclock_device_poll()
unconditionally iff DEVICE_POLLING is configured).  This avoids the
overhead of re-arming the period timer on every clock tick.  It would
be a good pessimization to use the standard callout interface for this.

A tickless kernel should go the other way and not call tc_ticktock()
every clock tick.  It should do the tc_tick counting itself (since
it uses virtual ticks or at least ticks of a highly variable and fuzzy
length and only it should have a vague idea about the tick lengths).
It needs to call tc_windup() often enough.  Only the timecounter
code really knows how often this is, but this must have some interaction
with the tickless code since lots of ticks (clock interrupts) may be
needed to keep timecounters working.  The main problem is that hardware
timecounters may wrap if you don't call them often enough.  This problem
is largest for the i8254 timecounter when the clock interrupt source is
also the i8254.  Then:
- clock interrupts must be scheduled at least as often as the i8254 wraps
   (at least every 54.9ms, so HZ must be > 18 with periodic ticks).
   Else the time must be recovered from somewhere else and the
   timecounter reinitialized before any timecounter interface is used
   again (same as after resume, except you might have to do this every
   54.9ms)
- the timecounter must be read at least as often as its hardware wraps.
   This is normally accomplished by reading it on every clock interrupt.
   tc_tick must be 1 for this.  A bug in the current initialization of
   tc_tick is now clear: suppose someone uses the i8254 timecounter and
   has HZ > 1000.  Then tc_tick > 1 and the timecounter is broken.  The
   i8254 can easily support much larger HZ than this.  I used to try
   to keep pcaudio maintained to test corner cases in timers.  pcaudio
   set the i8254 interrupt frequency to 62.5 kHz (?), though it only
   called hardclock at a frequency of HZ (normally 100).  The timecounter
   read in tc_windup() is now inadequate for avoiding wrap.  So the
   pcaudio low-level code virtualized the timecounter hardware by updating
   counts on every i8254 interrupt.  Only the low-level code sees every
   interrupt.  Other timecounter hardware that wraps too often should
   probably be handled similarly, but clock interrupts (possibly from
   another source) would be needed to keep it going.  I don't know of any
   other x86 hardware timecounters that wrap too fast.  The 32 bits of
   the TSC that are used wrap after ~1 second at 4GHz.

Apart from the problems with the i8254, there are few or no reasons to
call tc_windup() very often.  Every 1 second is probably enough.  I
always use HZ=100 and every 1/100 second is enough.   The default
is at most every 1/1000 second.  This seems to be mainly to keep the
i8254 working up to the default HZ although it is broken above that.
ntpd only updates things every 64 seconds or so.  By calling tc_windup()
very often, you mainly see each of thse updates take effect after only
1-10 msec.  The most interesting case is after a leap second is inserted.
Now it is good to see the leap second immediately, and even a delay of
1 msec may be too long.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120624142958.C850>