From owner-freebsd-stable@FreeBSD.ORG Mon Sep 7 01:08:41 2009 Return-Path: Delivered-To: stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EB32C1065676 for ; Mon, 7 Sep 2009 01:08:41 +0000 (UTC) (envelope-from peter@wemm.org) Received: from qw-out-2122.google.com (qw-out-2122.google.com [74.125.92.26]) by mx1.freebsd.org (Postfix) with ESMTP id ACF4D8FC13 for ; Mon, 7 Sep 2009 01:08:41 +0000 (UTC) Received: by qw-out-2122.google.com with SMTP id 3so547030qwe.7 for ; Sun, 06 Sep 2009 18:08:41 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.121.132 with SMTP id h4mr8772547qar.255.1252283789392; Sun, 06 Sep 2009 17:36:29 -0700 (PDT) In-Reply-To: <20090906155154.GA8283@onelab2.iet.unipi.it> References: <20090906155154.GA8283@onelab2.iet.unipi.it> Date: Sun, 6 Sep 2009 17:36:29 -0700 Message-ID: From: Peter Wemm To: Luigi Rizzo Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Cc: stable@freebsd.org Subject: Re: incorrect usleep/select delays with HZ > 2500 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 07 Sep 2009 01:08:42 -0000 On Sun, Sep 6, 2009 at 8:51 AM, Luigi Rizzo wrote: > (this problem seems to affect both current and -stable, > so let's see if here i have better luck) > > I just noticed [Note 1,2] that when setting HZ > 2500 (even if it is > an exact divisor of the APIC/CPU clock) there is a significant > drift between the delays generated by usleep()/select() and those > computed by gettimeofday(). =A0In other words, the error grows with > the amount of delay requested. > > To show the problem, try this function > > =A0 =A0 =A0 =A0int f(int wait_time) { =A0// wait_time in usec > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0struct timeval start, end; > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0gettimeofday(&start); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0usleep(w); =A0 =A0 =A0// or try select > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0gettimeofday(&end) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0timersub(&end, &start, &x); > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0return =3D x.tv_usec + 1000000*x.tv_sec - = wait_time; > =A0 =A0 =A0 =A0} > > for various HZ (kern.hz=3DNNNN in /boot/loader.conf) and wait times. > Ideally, we would expect the timings to be in error by something > between 0 and 1 (or 2) ticks, irrespective of the value of wait_time. > In fact, this is what you see with HZ=3D1000, 2000 and 2500. > But larger values of HZ (e.g. 4000, 5000, 10k, 40k) create > a drift of 0.5% and above (i.e. with HZ=3D5000, a 1-second delay > lasts 1.0064s and a 10s delay lasts 10.062s; with HZ=3D10k the > error becomes 1% and at HZ=3D40k the error becomes even bigger. Technically, it isn't even an error because the sleeps are defined as 'at least' the value specified. If you're looking for real-time-OS level performance, you probably need to look at one. > Note that with the fixes described below, even HZ=3D40k works perfectly w= ell. > > Turns out that the error has three components (described with > possible fixes): > > 1. =A0CAUSE: Errors in the measurement of the TSC (and APIC) frequencies, > =A0 =A0 =A0 =A0see [Note 3] for more details. This is responsible for the= drift. > =A0 =A0FIX: It can be removed by rounding the measurement to the closest > =A0 =A0 =A0 =A0nominal values (e.g. my APIC runs at 100 MHz; we can use a > =A0 =A0 =A0 =A0table of supported values). Otherwise, see [Note 4] > =A0 =A0PROBLEM: is this general enough ? > > 2. =A0CAUSE: Use of approximate kernel time functions (getnanotime/getmic= rotime) > =A0 =A0 =A0 =A0in nanosleep() and select(). This imposes an error of max(= 1tick, 1ms) > =A0 =A0 =A0 =A0in the computation of delays, irrespective of HZ values. > =A0 =A0 =A0 =A0BTW For reasons I don't understand this seems to affect > =A0 =A0 =A0 =A0nanosleep() more than select(). > =A0 =A0FIX: It can be reduced to just 1 tick making kern.timecounter.tick= writable > =A0 =A0 =A0 =A0and letting the user set it to 1 if high precision is requ= ired. > =A0 =A0PROBLEM: none that i see. > > 3. =A0CAUSE an error in tvtohz(), reported long ago in > =A0 =A0 =A0 =A0http://www.dragonflybsd.org/presentations/nanosleep/ > =A0 =A0 =A0 =A0which causes a systematic error of an extra tick in the co= mputation > =A0 =A0 =A0 =A0of the sleep times. > =A0 =A0FIX: the above link also contains a proposed fix (which in fact > =A0 =A0 =A0 =A0reverts a bug introduced in some old commit on FreeBSD) > =A0 =A0PROBLEM: none that i see. This change, as-is, is extremely dangerous. tsleep/msleep() use a value of 0 meaning 'forever'. Changing tvtohz() so that it can now return 0 for a non-zero tv is setting land-mines all over the place. There's something like 27 callers of tvtohz() in sys/kern alone, some of which are used to supply tsleep/msleep() timeouts. Note that the dragonflybsd patch above only adds the 'returns 0' check to one single caller. You either need to patch all callers of tvtohz() since you've change the semantics, or add a 'if (ticks =3D=3D 0) ticks =3D 1' check (or checks) in the appropriate places inside tvtohz(). If you don't do it, then you end up with callers of random functions with very small timeouts instead finding themselves sleeping forever. > Applying these three fixes i was able to run a kernel with HZ=3D40000 > and see timing errors within 80-90us even with ten second delays. > This would put us on par with Linux [Note 5]. > This is a significant improvement over the current situation > and the reason why I would like to explore the possibility of applying > some of these fixes. > > I know there are open problems -- e.g. when the timer source used > by gettimeofday() gets corrected by ntp or other things, hardclock() > still ticks at the same rate so you'll see a drift if you don't apply > corrections there as well. Similarly, if HZ is not an exact > divisor of the clock source used by gettimeofday(), i suppose > errors will accumulate as well. However fixing these other > drift seems reasonably orthogonal at least to #2 and #3 above, and > a lot more difficult so we could at least start from these simple > fixes. > > > Would anyone be interested in reproducing the experiment (test program > attached -- run it with 'lat -p -i wait_time_in_microseconds') > and try to explain me what changes the system's behaviour above HZ=3D2500= ? > > =A0 =A0 =A0 =A0cheers > =A0 =A0 =A0 =A0luigi > > Notes: > > [Note 1] I have some interest in running machines with high HZ values > =A0 =A0because this gives better precision to dummynet and various > =A0 =A0other tasks with soft timing constraints. > > [Note 2] I have seen the same phenomenon on the following platform: > =A0 =A0 =A0 =A0RELENG_8/amd64 with AMD BE-2400 dual core cpu > =A0 =A0 =A0 =A0RELENG_7/i386 with AMD BE-2400 dual core cpu > =A0 =A0 =A0 =A0RELENG_7/i386 with Intel Centrino single core (Dell X1 Lap= top) > > > [Note 3] the TSC frequency is computed reading the tsc around a > =A0 =A0 =A0 =A0call to DELAY(1000000) and assuming that the i8254 runs > =A0 =A0 =A0 =A0at the nominal rate, 1.193182 MHz. > =A0 =A0 =A0 =A0From tests I have made, the measurement in init_TSC() retu= rns > =A0 =A0 =A0 =A0a large error when HZ is large, whereas repeating the meas= urement > =A0 =A0 =A0 =A0at a later time returns a much more reliable value. > =A0 =A0 =A0 =A0As an example, see the following: > > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at init_TSC 2323045616 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: Features=3D0x178bfbff > =A0 =A0Sep =A06 14:21:59 lr kernel: AMD Features=3D0xea500800 > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC: P-state invariant > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at cpu_startup_end 2323056= 910 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at acpi_timer_probe 231125= 4060 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at acpi_timer_probe_2 2311= 191310 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at pn_probe_start 23008226= 48 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at pn_attach_start 2300830= 946 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at pn_probe_start 23008401= 33 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at pn_attach_start 2300835= 253 Hz > =A0 =A0Sep =A06 14:21:59 lr kernel: TSC clock: at lapic_setup_clock 23008= 68376 Hz > > =A0 =A0 =A0 =A0The latter values are close to what is reported when HZ=3D= 1000. > > [Note 4] DELAY() uses the TSC when available, but perhaps for > =A0 =A0 =A0 =A0larger intervals (say above 100ms) it might be better > =A0 =A0 =A0 =A0to always use the 8254 which at least does not change freq= uency > =A0 =A0 =A0 =A0over time ? > > [Note 5] Linux has some high precision timers which one could expect > =A0 =A0to use for high precision delays. In fact, I ran the test > =A0 =A0program on a recent Linux 2.6.30 (using HZ=3D1000), and the > =A0 =A0usleep() version has an error between 80 and 90us irrespective > =A0 =A0of the delay. However the select() version is much worse, > =A0 =A0and loses approx 1ms per second. > > ------------------------------------------------------ > _______________________________________________ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > --=20 Peter Wemm - peter@wemm.org; peter@FreeBSD.org; peter@yahoo-inc.com; KI6FJV "All of this is for nothing if we don't go to the stars" - JMS/B5 "If Java had true garbage collection, most programs would delete themselves upon execution." -- Robert Sewell