From owner-freebsd-stable@FreeBSD.ORG Sun Sep 6 15:46:04 2009 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0E146106566B for ; Sun, 6 Sep 2009 15:46:04 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id A90368FC0C for ; Sun, 6 Sep 2009 15:46:03 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id A824B730DA; Sun, 6 Sep 2009 17:51:54 +0200 (CEST) Date: Sun, 6 Sep 2009 17:51:54 +0200 From: Luigi Rizzo To: stable@freebsd.org Message-ID: <20090906155154.GA8283@onelab2.iet.unipi.it> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.2.3i Cc: Subject: incorrect usleep/select delays with HZ > 2500 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 06 Sep 2009 15:46:04 -0000 (this problem seems to affect both current and -stable, so let's see if here i have better luck) I just noticed [Note 1,2] that when setting HZ > 2500 (even if it is an exact divisor of the APIC/CPU clock) there is a significant drift between the delays generated by usleep()/select() and those computed by gettimeofday(). In other words, the error grows with the amount of delay requested. To show the problem, try this function int f(int wait_time) { // wait_time in usec struct timeval start, end; gettimeofday(&start); usleep(w); // or try select gettimeofday(&end) timersub(&end, &start, &x); return = x.tv_usec + 1000000*x.tv_sec - wait_time; } for various HZ (kern.hz=NNNN in /boot/loader.conf) and wait times. Ideally, we would expect the timings to be in error by something between 0 and 1 (or 2) ticks, irrespective of the value of wait_time. In fact, this is what you see with HZ=1000, 2000 and 2500. But larger values of HZ (e.g. 4000, 5000, 10k, 40k) create a drift of 0.5% and above (i.e. with HZ=5000, a 1-second delay lasts 1.0064s and a 10s delay lasts 10.062s; with HZ=10k the error becomes 1% and at HZ=40k the error becomes even bigger. Note that with the fixes described below, even HZ=40k works perfectly well. Turns out that the error has three components (described with possible fixes): 1. CAUSE: Errors in the measurement of the TSC (and APIC) frequencies, see [Note 3] for more details. This is responsible for the drift. FIX: It can be removed by rounding the measurement to the closest nominal values (e.g. my APIC runs at 100 MHz; we can use a table of supported values). Otherwise, see [Note 4] PROBLEM: is this general enough ? 2. CAUSE: Use of approximate kernel time functions (getnanotime/getmicrotime) in nanosleep() and select(). This imposes an error of max(1tick, 1ms) in the computation of delays, irrespective of HZ values. BTW For reasons I don't understand this seems to affect nanosleep() more than select(). FIX: It can be reduced to just 1 tick making kern.timecounter.tick writable and letting the user set it to 1 if high precision is required. PROBLEM: none that i see. 3. CAUSE an error in tvtohz(), reported long ago in http://www.dragonflybsd.org/presentations/nanosleep/ which causes a systematic error of an extra tick in the computation of the sleep times. FIX: the above link also contains a proposed fix (which in fact reverts a bug introduced in some old commit on FreeBSD) PROBLEM: none that i see. Applying these three fixes i was able to run a kernel with HZ=40000 and see timing errors within 80-90us even with ten second delays. This would put us on par with Linux [Note 5]. This is a significant improvement over the current situation and the reason why I would like to explore the possibility of applying some of these fixes. I know there are open problems -- e.g. when the timer source used by gettimeofday() gets corrected by ntp or other things, hardclock() still ticks at the same rate so you'll see a drift if you don't apply corrections there as well. Similarly, if HZ is not an exact divisor of the clock source used by gettimeofday(), i suppose errors will accumulate as well. However fixing these other drift seems reasonably orthogonal at least to #2 and #3 above, and a lot more difficult so we could at least start from these simple fixes. Would anyone be interested in reproducing the experiment (test program attached -- run it with 'lat -p -i wait_time_in_microseconds') and try to explain me what changes the system's behaviour above HZ=2500 ? cheers luigi Notes: [Note 1] I have some interest in running machines with high HZ values because this gives better precision to dummynet and various other tasks with soft timing constraints. [Note 2] I have seen the same phenomenon on the following platform: RELENG_8/amd64 with AMD BE-2400 dual core cpu RELENG_7/i386 with AMD BE-2400 dual core cpu RELENG_7/i386 with Intel Centrino single core (Dell X1 Laptop) [Note 3] the TSC frequency is computed reading the tsc around a call to DELAY(1000000) and assuming that the i8254 runs at the nominal rate, 1.193182 MHz. From tests I have made, the measurement in init_TSC() returns a large error when HZ is large, whereas repeating the measurement at a later time returns a much more reliable value. As an example, see the following: Sep 6 14:21:59 lr kernel: TSC clock: at init_TSC 2323045616 Hz Sep 6 14:21:59 lr kernel: Features=0x178bfbff Sep 6 14:21:59 lr kernel: AMD Features=0xea500800 Sep 6 14:21:59 lr kernel: TSC: P-state invariant Sep 6 14:21:59 lr kernel: TSC clock: at cpu_startup_end 2323056910 Hz Sep 6 14:21:59 lr kernel: TSC clock: at acpi_timer_probe 2311254060 Hz Sep 6 14:21:59 lr kernel: TSC clock: at acpi_timer_probe_2 2311191310 Hz Sep 6 14:21:59 lr kernel: TSC clock: at pn_probe_start 2300822648 Hz Sep 6 14:21:59 lr kernel: TSC clock: at pn_attach_start 2300830946 Hz Sep 6 14:21:59 lr kernel: TSC clock: at pn_probe_start 2300840133 Hz Sep 6 14:21:59 lr kernel: TSC clock: at pn_attach_start 2300835253 Hz Sep 6 14:21:59 lr kernel: TSC clock: at lapic_setup_clock 2300868376 Hz The latter values are close to what is reported when HZ=1000. [Note 4] DELAY() uses the TSC when available, but perhaps for larger intervals (say above 100ms) it might be better to always use the 8254 which at least does not change frequency over time ? [Note 5] Linux has some high precision timers which one could expect to use for high precision delays. In fact, I ran the test program on a recent Linux 2.6.30 (using HZ=1000), and the usleep() version has an error between 80 and 90us irrespective of the delay. However the select() version is much worse, and loses approx 1ms per second. ------------------------------------------------------