From owner-freebsd-arch@FreeBSD.ORG Thu Mar 1 00:33:52 2012 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 06A10106564A for ; Thu, 1 Mar 2012 00:33:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail06.syd.optusnet.com.au (mail06.syd.optusnet.com.au [211.29.132.187]) by mx1.freebsd.org (Postfix) with ESMTP id 36A208FC17 for ; Thu, 1 Mar 2012 00:33:50 +0000 (UTC) Received: from c211-30-171-136.carlnfd1.nsw.optusnet.com.au (c211-30-171-136.carlnfd1.nsw.optusnet.com.au [211.30.171.136]) by mail06.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q210Xkbe009834 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 1 Mar 2012 11:33:48 +1100 Date: Thu, 1 Mar 2012 11:33:46 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Luigi Rizzo In-Reply-To: <20120229194042.GA10921@onelab2.iet.unipi.it> Message-ID: <20120301071145.O879@besplex.bde.org> References: <20120229194042.GA10921@onelab2.iet.unipi.it> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@FreeBSD.org Subject: Re: select/poll/usleep precision on FreeBSD vs Linux vs OSX X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 01 Mar 2012 00:33:52 -0000 On Wed, 29 Feb 2012, Luigi Rizzo wrote: > I have always been annoyed by the fact that FreeBSD rounds timeouts > in select/usleep/poll in very conservative ways, so i decided to > try how other systems behave in this respect. Attached is a simple > program that you should be able to compile and run on various OS > and see what happens. Many are broken, indeed. The simple program isn't attached. > Here are the results (HZ=1000 on the system under test, and FreeBSD > has the same behaviour since at least 4.11): > > | Actual timeout > | select | poll | usleep| > timeout | FBSD | Linux | OSX | FBSD | FBSD | > usec | 9.0 | Vbox | 10.6 | 9.0 | 9.0 | > --------+-------+-------+--------+-------+-------+ > 1 2000 99 6 0 2000 Try HZ = 20 (possible, at the user's option, even with an i8254 timer) or lower (possible, at the user's option, with better timers). FreeBSD should then get timeouts of up to 2/HZ = 100000 us. Applications must deal with this range somehow (maybe by telling the user to configure HZ better). In ttcp, I found the timeouts unusable and resorted to an option to busy-wait. The rate-limiting timeouts in tools/netrate don't work at all for small HZ, and barely work for large HZ, since timeouts are similarly unusable. It is possible to easily improve this to a maximum of only 1/HZ = 50000 us, at some cost to efficiency, by waking up 1 tick early, checking if the timeout has expired, and sleeping for another tick if it hasn't. Waking up early is needed anyway for long timeouts, in case the timeouts interrupts are running a little slower than their estimated frequency -- being off by just 1 part per million will accumulate to an error of 86400 us after 1 day, and FreeBSD cluser machines used to be off by about 10%, giving an error of 2.4 hours for your appointment next day. The error will often be 100's of parts per million, giving an error of 10's of seconds per day. To handle the 10% error, timeouts must wake up 10% early. select(), poll(), and nanosleep() all check that the timeout expires when they wake up, but they don't set it to be wake up early, and their check for whether it has expired is even fuzzier than the timeout granularity (it uses the broken-as-designed getnanouptime() API to get the current time; timecounters are only updated every $(sysctl kern.timecounter.tick) ticks, by default to limit their update frequency to about 1000 Hz when HZ is configured to be much larger than 1000. This ensures an extra, unnecessary innaccuracy of up to 1000 us whenever the timeout wakes up a little early according the the retarded clock used to measure the current time. I think the error is fail-safe -- it may extend the timeout by as much as 1000 usec. > 10 2000 109 15 0 2000 > 50 2000 149 66 0 2000 > 100 2000 196 133 0 2000 > 500 2000 597 617 0 2000 You must have synced with timer interrupts to get the above. Timeouts in the current FreeBSD implementation should average the actual timeout rounded up to a multiple of 1/HZ seconds, plus 0.5/HZ seconds, and thus average 1.5/HZ = 1500 us for short timeouts. Someone apparently broke poll() on FreeBSD :-(. Linux and OSX must be using busy-waiting or expensive timer reprogramming for short timeouts to work. Linux-2.6.10 has 3491 references to udelay(). This seems to correspond to FreeBSD' DELAY(). The linux nanosleep() code is too complicated for me to easily see what it is doing for short timeouts, but I noticed that it isn't missing clock_nanosleep() like FreeBSD does, and presumably has a collaterally non-broken nanosleep(). (POSIX requires nanosleep() to sleep in real time, to be bug for bug compatible with old sleep(), but this is often not what is wanted. So POSIX invented clock_nanosleep() so as to be able to sleep on the monotonic clock and on any other clock of interest. FreeBSD doesn't know anything about this, and only has nanosleep(), which sleeps on a wrong clock (the monotonic one). > 1000 2000 1103 1136 2000 2000 > 1001 3000 1103 1136 2000 3000 <--- > 1500 3000 1608 1631 2000 3000 <--- > 2000 3000 2096 2127 3000 3000 > 2001 4000 3000 4000 <--- > 3001 5000 4000 5000 <--- > > Note how the rounding (poll has the timeout in milliseconds) affects > the actual timeouts when you are past multiples of 1/HZ. Also, timeouts that are just before a multiple of 1/HZ may be turned into 1/HZ + 1000 usec by the inacurracy of getnanouptime(). E.g., if the requested timeout is 999, HZ is 1000, and tc_tick is 1, 999 should be turned into 2 ticks (average 1500 usec). Then, if the timeout for the second tick is a little early according to the retarded clock, the total timeout will be extended by another tick, to 3 ticks. So all timeouts may be extended by up to 2 ticks, instead of only ones just larger than a multiple of 1/HZ being extended by the full 2 ticks. This can be made much worse by setting tc_tick to a large value. > I know that until we have some hi-res interrupt source there is no > hope to have better than 1/HZ granularity. However we are doing > much worse by adding up to 2 extra ticks. This makes apps less > responsive than they could be, and gives us no way to > "yield until the next tick". > > So what I would like to do is add a sysctl (disabled by > default) that enables a better approximation of the desired delay. It is possible to get a timeout every tick in userland using periodic itimer's. Maybe not for the initial timeout, but after that the timeouts repeat with the specified period (rounded up to the next tick boundary, but not up to the next + 1). > I see in the kernel that all three syscalls loop around a blocking > function (tsleep or seltdwait), and do check the "actual" elapsed > time by calling getmicrouptime() or getnanouptime() around the > sleeping function . So the actual timeout passed to tsleep does > not really matter (as long as it is greater than 0 ). > > The only concern is that getmicrouptime()/getnanouptime() are documented > as "less precise, but faster to obtain". The question is how precise is > "less precise": do we have some way to get an upper bound for the It is tc_tick/HZ seconds (usually 1/HZ, but if you set HZ to be > 1000 to reduce this problem, then tc_tick will bite you unless you change it down. Both may be acceptable on a real-timeish system that wants short timeouts at the cost of efficiency. But I don't like timeouts. When they are used a lot, they are a form of busy waiting. This is only acceptable if you have CPU to burn). > precision of the timers used in get*time(), so we can use that value > in the equation instead of the extra 1/HZ that tvtohz() puts in > after computing floor(timeout*HZ) ? It's always worse than 1/HZ if you use these broken-as-designed APIs. > For reference, below is the core of usleep and select/poll > (from kern_time.c and sys_generic.c) > > usleep: > getnanouptime(now) Should use nanotime(). This also fixes the clock id. But check whether this is really required by POSIX. It means that if someone steps the clock, then the sleep may be extended or truncated significantly. You also need to fix the timeout used here to sleep in real time, so that it doesn't wake up late by an amount of (negative of the step). Timeouts also sleep in monotonic time. Sleeping in monotonic time seems to be wrong in more cases than it is correct. Another broken area is suspend/resume. Suspending for an hour stops all timeouts by an hour. Then on resume, they aren't adusted, so they occur an hour later. There is still code in kern_timeout.c under APM_FIXUP_CALLTODO which was supposed to fix this problem for apm, but it is so poorly maintained that it never even compiled in any committed version, and there is no option for it. Problems in this area were supposed to have been fixed by using monotonic time more, but they seem to have actually been increased. > end = now + timeout; This is quite broken too: - it doesn't arrange to wake up early. Sutract 1 tick here for a quick fix for short timeouts. Maybe 10% for timeouts longer than 10 ticks. - it can overflow, giving undefined behaviour (in practice, just bad results later) - itimerfix() is supposed to be used to prevent such overflows, but this is quite broken: - note that itimerfix() is quite different from timevalfix(). Although its name is spelled with an 'i' and a 'fix', itimerfix() is useful generally and doesn't fix anything except for bogusly adjusting fractional ticks. It is also confusing because its name is spelled without a `val', since it was intended to be used only for itimers which used to use generic times (timevals) back before better representations of times existed. What it mosly does is validity checking for timevals. OTOH, timevalfix() does pure fixing (to handle carry after (possibly multiple) additions and subtractions). - timevalfix() used to limit tv_sec to 100 million seconds (else EINVAL), but someone broken it by removing this check - someone didn't update the man pages which document this limit. It is still documented in at least: - setitimer(2). This is its primary API - alarm(3) - ualarm(3). This also bogusly documents what a microsecond is Grepping for 100000000 and 100.*million also shows many bad descriptions of the limit on tv_nsec for APIs that take a timespec arg. This limit is described verbosely as "1000 million". Of course, "1 billion" cannot be used since it is ambiguous, and 1000000000 should not be used because it is hard to see the number of zeros in it, but millions aren't naturally associated with the nano prefix. I like to write such numbers in minimal floating point scientific notation (e.g., 1e9) or as powers of 10 (e.g., 10**9). 1e9 is better because it is shorter and doesn't need an ambiguous '^' operator or a less common Fortran '**' operator for exponentiation. Not sure if this is best for man pages. select() and kqueue() used to have the same limit, but it was never documented for them. Not sure about kqueue. - this limit isn't permitted by POSIX. - here we have timespecs. itimespecfix() exists too (although timespecfix() doesn't). itimespecfix() has the same semantics as itimerfix() (except of course it obviously acts on timespecs while itimerfix() unobviously acts on timevals). itimespecfix() never had the limit on tv_sec. But timespecfix() has the same bogus rounding up of fractional ticks as timerfix(). This is only done for fractional ticks below 1. This should be unnecessary provided everywhere else is careful to round up and usually to add 1. timespec*fix() never did the adding 1 part. I think they are just defending against sloppy conversions that produce 0 ticks from small but nonzero timeouts. A timeout of 0 ticks means to sleep forever. But relevant higher levels also defend against this, by silently changing 0 ticks to 1 tick. - back to bugs at the level of nanosleep(). It can't use itimerfix() since it deals with timespecs. It should call something like itimespecfix() (except that should be named timespeccheck()...). But IIRC, itimespecfix() didn't exist when nanosleep() was implemented. Also, itimespecfix() has wrong semantics for use here.. Its bogus rounding up is exactly what you don't want. I think it has other slightly mismatched semantics (perhaps a difference in error numbers), but the others are easy to fix up. So nanosleep() rolled its own checking, and got it wrong. The overflow seems easy to fix as a side affect of waking up early: - for preposterosterously long timeouts, wake up after 100 million seconds or similar, instead of 10% early. This delays the problem. 100 million seconds is a little over 3 years, so it won't expire in practice, and no one would care if it did. But this method (and the old limit) breaks down about 3 years before the time_t's roll over. So in 2036, you need to limit the timeout to only 2 years instead of 3, if you are still using 31-bit (sic) time_t's with a useless sign bit then. And in 2105, you need to limit the timeout to only 2 years instead of 3, if you are still using 32-bit unsigned time_t's then. Be careful with overflow even with this fix. Applications probing for kernel bugs will try using maximal tv_sec. Since POSIX doesn't allow rejecting these like the old 100 million second limit did, we must start with a long sleep and retain the original preposterous timeout so that we can return it as the unslept time. We have to be careful about overflow when adding the preposterous time to the current time. Large time_t's don't do anything to limit this overflow, since the appllication can ask for the maximum (2**31-1 or 2**63-1 in practice), and since the current time is surely >= 1, adding the current time to the preposterous time surely gives overflow. Example of a not-unreasonable POSIX application to probe for bugs in this area: set tv.tv_sec and tv.tv_nsec to the maximum possible ("infinity") arrange for a signal after 1 second nanosleep(&tv, &tv2); check that tv2 is about 1 second below the maximum possible > for (;;) { > getnanouptime(now); Our original `now' and thus `end' are retarded by up to tc_tick/HZ. This `now' is retarded too. This complicates the analysis and changes its results. > delta = end - now; So `delta' might not even be retarded. If `end' is normal but `now' is retarded, then `delta' is advanced. This case is fail-safe but not what you want (sleep again). If `end' is retarded but `now' is normal, then the retardation in `delta' is maximal (still less than tc_tick/HZ). This case is fail-unsafe (return up to tc_tick/HZ early). Other cases are in between these, with the retardations partially or completely cancelling. By making tc_tick large, the fail-unsafe case can be made to more than overcome the safety margin of about 1 1/HZ given by always adding 1. I only just noticed this detail. > if (delta <= 0) > break; > tsleep(..., tvtohz(delta) ) > } > > select/poll: > itimerfix(timeout) // force at least 1/HZ That's "bogusly force". It doesn't add 1 or do anything if the timeout is above 1/HZ, but both are done later. > getmicrouptime(now) > end = now + timeout; Same retardation and overflow bugs. > for (;;) { Missing showing getmicrouptime(now) here? > delta = end - now; > seltdwait(..., tvtohz(delta) ) tvtohz() does round up and add 1. Its interaction with the above is unclear. I think there is double rounding up in some cases. For example, even without the retardation, a delta that wants to be precisely 1 tick (much more than it should be due to the first rounding up) may be 1 us over 1 tick due to minor inaccuracies. Then tvtohz() will round up again. nanosleep() has the same problem. To handle this, we may need to subtract 1 more from the result of tvtohz(): - always subtract 10% - subtract 1 to compensate for tvtohz() always adding 1. See the periodic itimer code for this fixup. Periodic itimers need to be more careful for long timeouts too. - subtract 1 more in case delta is a little too large. Better look at delta and not always do this. - if the resulting timeout is <= 0, change it to 1. > getmicrouptime(now); > if (some_fd_is_ready() || now >= end) > break; > } Activity on the fd's is likely to give more wakeups than nanosleep() gets, since the latter only gets woken up for its own timeout and signals. For this and other reasons, large timeouts are probably smaller for select() than for nanosleep(). And for poll(), you just can't ask for a large timeout (the limit is (2**31-1) milliseconds ~= 24.8 days with 32-bit ints, as is the case on all supported arches). It remains to explain why the above results show that poll() but not select() is broken for small timeouts (they are turned into 0 us for poll() and 2000 us for select()). Well, the granularity for poll is 1 ms, so this looks like just an application bug, with timeouts of < 1 ms being rounded down to 0 before the kernel sees them. But how do the other OS's see it? This might be due to them taking a long time to handle null timeouts, and their times actually being reported correctly. I don't believe the times of 0 and 2000 us reported for FreeBSD. You can't do anything in 0 us, and 2000 us is too round a number. These round numbers might be due to using the broken as designed CLOCK_MONOTONIC_FAST_N_BROKEN clock ids. These are collateral with getnanouptime() etc. They are even more broken as designed, since provided you have non-slow timecounter hardware, the time for clock_gettime() is dominated by syscall overhead, so CLOCK_MONOTONIC_FAST_N_BROKEN is only a few percent faster than CLOCK_MONOTIC. Typical numbers are: - 12 (9?) cycles for the hardware part of a TSC timecounter on an old Athlon64. ~250 cycles for the total syscall overhead for an old version of FreeBSD UP on almost any x86. Possible savings from the "fast" method: about 5%. More like 10% due to extra software overhead for the timecounter. - 42 (?) cycles for the hardware part of a TSC timecounter on Phenom+ (synchronization across CPUs makes it much slower). Similarly for most modern multi-core CPUs. Intel CPUs were much slower than 12 cycles even for old single-core ones. ~350 cycles for the total syscall overhead for a current version of FreeBSD SMP on almost any x86. Possible savings from the "fast" method: about 15%. IIRC, SMP only costs 20-30 cycles, with the extra 100 being mainly from extra layers. - 1000-2000 nsec (up to ~8000 cycles) for an ACPI-FAST timecounter. These are actually ACPI-SLOW. Now the hardware overhead dominates. HPET is better, but has become common at the same time as the TSC became usable for SMP, so it is rarely useful as a timecounter. - up to 5000 nsec for an i8254 timecounter on a modern CPU. Getting slow with more briges between the CPU and the ISA bus. - up to 30000 nsec for an i8254 timecounter on a 486. OTOH, getnanouptime() takes about 12 cycles (very fuzzy estimate), so it is 2-3 times faster than nanouptime() using the fastest TSC hardware, and about 5-6 times faster than nanouptime() using slower TSC hardware. IIRC, itimer code doesn't do these checks of the time after wakeups at all. Not sure what kqueue does. I haven't really touched nanosleep(), but have some small fixes near the tvtohz() call for select() and poll(). % Index: sys_generic.c % =================================================================== % RCS file: /home/ncvs/src/sys/kern/sys_generic.c,v % retrieving revision 1.131 % diff -u -2 -r1.131 sys_generic.c % --- sys_generic.c 5 Apr 2004 21:03:35 -0000 1.131 % +++ sys_generic.c 13 Aug 2009 11:21:29 -0000 % @@ -806,9 +797,5 @@ % getmicrouptime(&rtv); % timevaladd(&atv, &rtv); % - } else { % - atv.tv_sec = 0; % - atv.tv_usec = 0; % } % - timo = 0; % TAILQ_INIT(&td->td_selq); % mtx_lock(&sellock); % @@ -824,5 +811,7 @@ % if (error || td->td_retval[0]) % goto done; % - if (atv.tv_sec || atv.tv_usec) { % + if (tvp == NULL) % + timo = 0; % + else { % getmicrouptime(&rtv); % if (timevalcmp(&rtv, &atv, >=)) Unrelated cleanups of initialization. % @@ -830,13 +819,10 @@ % ttv = atv; % timevalsub(&ttv, &rtv); % - timo = ttv.tv_sec > 24 * 60 * 60 ? % - 24 * 60 * 60 * hz : tvtohz(&ttv); % + timo = tvtohz(&ttv); The special case for timeouts of > 1 day defeats the careful overflow handling in tvtohz(). It is supposed to be for avoiding overflow, but tvtohz() avoids it already, while the above causes it whenever hz is large but not preposterously so, so that 24 * 60 * 60 * hz overflows. hz only needs to be 24586 for overflow. 25000 is almost reasonable for excessive polling, and I have tested lapic timer interrupts though not hz at 1MHz. However, reduction of the timeout to a value that will wake up 10% early is the first step of fixing the bugs discussed above. Reduction to 1 day accomplishes this for timeouts of >= 1.1 days provided it doesn't overflow. % } % % /* % - * An event of interest may occur while we do not hold % - * sellock, so check TDF_SELECT and the number of % - * collisions and rescan the file descriptors if % - * necessary. % + * An event of interest may have occurred while we did not hold % + * sellock. Check for this and rescan if necessary. % */ % mtx_lock_spin(&sched_lock); Unrelated. % @@ -985,4 +978,5 @@ % if (uap->timeout != INFTIM) { % atv.tv_sec = uap->timeout / 1000; % + /* XXX wrong if timeout < 0. */ % atv.tv_usec = (uap->timeout % 1000) * 1000; Since the '%' operator is broken for negative values in C, this gives a negative tv_usec when the timeout is negative. % if (itimerfix(&atv)) { itimerfix() then returns EINVAL, and the syscall fails. But a timeout of < 0 should be equivalent to a timeout of 0, as it is for select() and nanosleep(). This can be implemented either by fixing C, or by fixing the espression, or by just changing negative timeouts to 0. % @@ -992,9 +986,5 @@ % getmicrouptime(&rtv); % timevaladd(&atv, &rtv); % - } else { % - atv.tv_sec = 0; % - atv.tv_usec = 0; % } % - timo = 0; % TAILQ_INIT(&td->td_selq); % mtx_lock(&sellock); % @@ -1006,9 +996,11 @@ % mtx_unlock(&sellock); % % - error = pollscan(td, (struct pollfd *)bits, nfds); % + error = pollscan(td, bits, nfds); % mtx_lock(&sellock); % if (error || td->td_retval[0]) % goto done; % - if (atv.tv_sec || atv.tv_usec) { % + if (uap->timeout == INFTIM) % + timo = 0; % + else { % getmicrouptime(&rtv); % if (timevalcmp(&rtv, &atv, >=)) Unrelated cleanups. % @@ -1016,12 +1008,8 @@ % ttv = atv; % timevalsub(&ttv, &rtv); % - timo = ttv.tv_sec > 24 * 60 * 60 ? % - 24 * 60 * 60 * hz : tvtohz(&ttv); % + timo = tvtohz(&ttv); As for select(). % } % - /* % - * An event of interest may occur while we do not hold % - * sellock, so check TDF_SELECT and the number of collisions % - * and rescan the file descriptors if necessary. % - */ % + % + /* Rescan if necessary, as above. */ Don't repeat comments ad nauseum. There used to be large grammar errors in these comments. -current may have cleaned them up differently. % mtx_lock_spin(&sched_lock); % if ((td->td_flags & TDF_SELECT) == 0 || nselcoll != ncoll) { Bruce