Date: Sun, 24 Jun 2012 16:36:24 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alexander Motin <mav@FreeBSD.org> Cc: Davide Italiano <davide@FreeBSD.org>, src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, svn-src-head@FreeBSD.org, Bruce Evans <brde@optusnet.com.au>, Marius Strobl <marius@alchemy.franken.de>, Konstantin Belousov <kostikbel@gmail.com> Subject: Re: svn commit: r237434 - in head/lib/libc: amd64/sys gen i386/sys include sys Message-ID: <20120624142958.C850@besplex.bde.org> In-Reply-To: <4FE6209B.7050809@FreeBSD.org> References: <201206220713.q5M7DVH0063098@svn.freebsd.org> <20120622073455.GE69382@alchemy.franken.de> <20120622074817.GA2337@deviant.kiev.zoral.com.ua> <20120623131757.GB46065@alchemy.franken.de> <20120623140556.GU2337@deviant.kiev.zoral.com.ua> <20120624005418.W2417@besplex.bde.org> <4FE6209B.7050809@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 23 Jun 2012, Alexander Motin wrote: > On 06/23/12 18:26, Bruce Evans wrote: >> On Sat, 23 Jun 2012, Konstantin Belousov wrote: >>> On Sat, Jun 23, 2012 at 03:17:57PM +0200, Marius Strobl wrote: >>>> So apart from introducing code to constantly synchronize the >>>> TICK counters, using the timecounters on the host busses also >>>> seems to be the only viable solution for userland. The latter >>>> should be doable but is long-winded as besides duplicating >>>> portions of the corresponding device drivers in userland, it >>>> probably also means to get some additional infrastructure >>>> like being able to memory map registers for devices on the >>>> nexus(4) level in place ... >> >> There is little point in optimizations to avoid syscalls for hardware. >> On x86, a syscall takes 100-400 nsec extra, so if the hardware takes >> 500-2000 nsec then reduction the total time by 100-400 nsec is not >> very useful. > > Just out of curiosity I've run my own binuptime() micro-benchmarks: > - on Core i5-650: > TSC 11ns > HPET 433ns > ACPI-fast 515ns > i8254 3736ns The TSC is surprisingly fast and the others are depressingly slow, although about the fastest I've seen for bus-based timecounters. On Athlon64, rdtsc() takes 6.5 cycles, but I thought all P-state invariant TSCs took > 40 cycles. rdtsc() takes 65 cycles on FreeBSD x86 cluster machines (core2 Xeon), except on freefall (P4(?) Xeon). I hardly believe 11ns. That's 44 cycles at 4GHz. IIRC, the Athlon64 at 2.2GHz took 29nsec for binuptime() last time I measured it (long ago, when it still had the statistics counter pessimization). > - on dual-socket Xeon E5645: > TSC 15ns > HPET 580ns > ACPI-fast 1118ns > i8254 3911ns > > I think it could be useful to have that small benchmark in base kernel. I think kib put one in src/tools for userland. I mostly use a userland one. Except for the TSC, the overhead for the kernel parts can be estimate accurately from userland, since it is so large. This is more normal slowness for ACPI-[!]fast. freefall still uses ACPI-fast and it takes a minimum of 1396 and an average of 1729nsec from usrerland (load average 1.3). Other x86 cluster machines now use TSC-[s]low, and it takes a minimum of 481 and an average of 533nsec (now the swing from 481 to 533 is given by its gratuitous impreciseness and not by system load). BTW, the i8254 timecounter can be made about 3/2 times faster if anyone cared, by reading only the low 8 bits of the timer. This would require running clock interrupts at >= 4kHz so that the top 8 bits are rarely needed (great for a tickless kernel :-), or maybe by using a fuzzier timer to determine when the top bits are needed. At ~2500ns, it would be only slightly slower than the slowest ACPI-fast, and faster than ACPI-safe. OTOH, I have measured i8254 timer reads taking 138000ns (on UP with interrupts disabled) on a system where they normally take only 4000ns. Apparently the ISA bus waits for other bus activity (DMA?) for that long. Does this happen for other buses? Extra bridges for ISA can't help. >> ... >> The new timeout code to support tickless kernels looks like it will give >> large pessimizations unless the timecounter is fast. Instead of using >> the tick counter (1 atomic increment on every clock tick) and some >> getbinuptime() calls in places like select(), it uses the hardware >> timecounter via binuptime() in most places (since without a tick counter >> and without clock interrupts updating the timehands periodically, it takes >> a hardware timecounter read to determine the time). So callout_reset() >> might start taking thousands of nsec for per call, depending on how slow >> the timecounter is. This fix is probably to use a fuzzy time for long >> long timeouts and to discourage use of short timeouts and/or to turn them >> into long or fuzzy timeouts so that they are not very useful. > > The new timeout code is still in active development and optimization was not > the first priority yet. My idea was to use much faster getbinuptime() for > periods above let's say 100ms. You would need to run non-tickless with a clock interrupt frequency of >= 10Hz to keep getbinuptime() working. Seems like a bad thing to aim for. Better not use bintimes at all. I would try using pseudo-ticks, (where the tick counter is advanced on every not-very-periodic clock interrupt and at some other times when you know that clock interrupts have been stopped, and maybe at other interesting places (all interrupts and all syscalls?)). Only call binuptime() every few thousand pseudo-ticks to prevent long-term drift. Timeouts would become longer and fuzzier than now, but that is a feature (it inhibits using them for busy-waiting). You know when you scheduled clock interrupts and can advance the tick counter to represent the interval between clock interrupts fairly accurately (say to within 10%). The fuzziness comes mainly from not scheduling clock interrupts very often, so that for example when something asks for a sleep of 1 tick now, it might take 100 times longer because there isn't a clock interrupt for 100 times longer. You also should schedule clock interrupts just because something asks for a short timeout. > Legacy ticks-oriented callout_reset() > functions are by default not supposed to provide sub-tick resolution and with > some assumptions could use getbinuptime(). For new interfaces it depends on > caller, how will it get present time. Even 1 tick is too short. Using binuptime() encourages asking for much shorter intervals. Even for long sleeps, many places try to micro-sleep for the residual time after waking up early. E.g., nanotime(), select() and poll(). These places can also ask for an initial sleep with a resolution of nsec, usec or msec, respectively. If the timeout code actually honors these requests, then it would generate lots of clock interrupts and even more overheads by allowing more timeouts to actually expire. OTOH, supporting nano-sleeps allows nanotime() to actually approach its name. > I understand that integer tick counter is as fast as nothing else can ever > be. But sorry, 32bit counter doesn't fit present goals. To have more we need On the contrary, it becomes more adequate than with periodic ticks, since you need to reduce the tick frequency, so 32 bit works for longer. > some artificial atomicity -- exactly what getbinuptime() implements. Why would you need any atomicity? Timeouts become fuzzier (because you can't afford to generate clock interrupts to keep them as short as possible, and want to generate even fewer clock interrupts than now). Who cares if a non-atomic comparison results on more fuzziness. OTOH, if a timeout actually expires, it would be good to maintain the invariant that it never expires early, and some sort of clock that is known to never run fast (relative to all earlier times measured on it) is needed to ensure this, and some atomicity is also required for this. I think most timeouts never expire (because most are for emergency conditions), so checking the time accurately only when a timeout expires according to a fuzzy clock may be efficient enouugh. The problem is the initial time read for converting a relative time to an absolute expiry time -- unless that is accurate, the expiry time is fuzzy. > What I > would like to see there is tc_tick removal to make tc_windup() called for > every hardclock tick. That would break it. People can set HZ to 10KHz or more (I once tried 1MHz with lapic_timer, and it worked more or less correctly). Calling it that often would make the timehands cycle too fast, and tc_tick is used to prevent this. You could "fix" this by increasing the number of timehands from 10 to 1000[0..] according to HZ. > Having new tick-irrelevant callout interfaces we > probably won't so much need to increase HZ too high any more, while this > simplification would make ticks and getbinuptime() precision equal, solving > some of your valid arguments against the last. People increase HZ to do bogus polling. A periodic timer for this is at least as efficient as possible. The periodic timer for this is hung off hardclock() (hardclock() calls hardclock_device_poll() unconditionally iff DEVICE_POLLING is configured). This avoids the overhead of re-arming the period timer on every clock tick. It would be a good pessimization to use the standard callout interface for this. A tickless kernel should go the other way and not call tc_ticktock() every clock tick. It should do the tc_tick counting itself (since it uses virtual ticks or at least ticks of a highly variable and fuzzy length and only it should have a vague idea about the tick lengths). It needs to call tc_windup() often enough. Only the timecounter code really knows how often this is, but this must have some interaction with the tickless code since lots of ticks (clock interrupts) may be needed to keep timecounters working. The main problem is that hardware timecounters may wrap if you don't call them often enough. This problem is largest for the i8254 timecounter when the clock interrupt source is also the i8254. Then: - clock interrupts must be scheduled at least as often as the i8254 wraps (at least every 54.9ms, so HZ must be > 18 with periodic ticks). Else the time must be recovered from somewhere else and the timecounter reinitialized before any timecounter interface is used again (same as after resume, except you might have to do this every 54.9ms) - the timecounter must be read at least as often as its hardware wraps. This is normally accomplished by reading it on every clock interrupt. tc_tick must be 1 for this. A bug in the current initialization of tc_tick is now clear: suppose someone uses the i8254 timecounter and has HZ > 1000. Then tc_tick > 1 and the timecounter is broken. The i8254 can easily support much larger HZ than this. I used to try to keep pcaudio maintained to test corner cases in timers. pcaudio set the i8254 interrupt frequency to 62.5 kHz (?), though it only called hardclock at a frequency of HZ (normally 100). The timecounter read in tc_windup() is now inadequate for avoiding wrap. So the pcaudio low-level code virtualized the timecounter hardware by updating counts on every i8254 interrupt. Only the low-level code sees every interrupt. Other timecounter hardware that wraps too often should probably be handled similarly, but clock interrupts (possibly from another source) would be needed to keep it going. I don't know of any other x86 hardware timecounters that wrap too fast. The 32 bits of the TSC that are used wrap after ~1 second at 4GHz. Apart from the problems with the i8254, there are few or no reasons to call tc_windup() very often. Every 1 second is probably enough. I always use HZ=100 and every 1/100 second is enough. The default is at most every 1/1000 second. This seems to be mainly to keep the i8254 working up to the default HZ although it is broken above that. ntpd only updates things every 64 seconds or so. By calling tc_windup() very often, you mainly see each of thse updates take effect after only 1-10 msec. The most interesting case is after a leap second is inserted. Now it is good to see the leap second immediately, and even a delay of 1 msec may be too long. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120624142958.C850>