From owner-svn-src-projects@FreeBSD.ORG Sat Jun 2 19:28:54 2012 Return-Path: Delivered-To: svn-src-projects@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 911561065673; Sat, 2 Jun 2012 19:28:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail08.syd.optusnet.com.au (mail08.syd.optusnet.com.au [211.29.132.189]) by mx1.freebsd.org (Postfix) with ESMTP id 0D1B68FC12; Sat, 2 Jun 2012 19:28:53 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q52JSjRi010721 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Jun 2012 05:28:46 +1000 Date: Sun, 3 Jun 2012 05:28:45 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Motin In-Reply-To: <4FCA2988.8090106@FreeBSD.org> Message-ID: <20120603022539.C2468@besplex.bde.org> References: <201206021304.q52D4p2X090537@svn.freebsd.org> <20120602233307.S1957@besplex.bde.org> <4FCA2988.8090106@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Davide Italiano , svn-src-projects@FreeBSD.org, src-committers@FreeBSD.org, Bruce Evans Subject: Re: svn commit: r236449 - projects/calloutng/sys/kern X-BeenThere: svn-src-projects@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the src " projects" tree" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Jun 2012 19:28:54 -0000 On Sat, 2 Jun 2012, Alexander Motin wrote: > On 06/02/12 17:16, Bruce Evans wrote: >> On Sat, 2 Jun 2012, Davide Italiano wrote: > ... >>> Modified: projects/calloutng/sys/kern/kern_timeout.c >>> ============================================================================== >>> --- projects/calloutng/sys/kern/kern_timeout.c Sat Jun 2 12:26:14 2012 >>> (r236448) >>> +++ projects/calloutng/sys/kern/kern_timeout.c Sat Jun 2 13:04:50 2012 >>> (r236449) >>> @@ -373,9 +373,9 @@ callout_tick(void) >>> need_softclock = 0; >>> cc = CC_SELF(); >>> mtx_lock_spin_flags(&cc->cc_lock, MTX_QUIET); >>> - binuptime(&now); >>> + getbinuptime(&now); >>> /* >>> - * Get binuptime() may be inaccurate and return time up to 1/HZ in >>> the past. >>> + * getbinuptime() may be inaccurate and return time up to 1/HZ in the >>> past. >>> * In order to avoid the possible loss of one or more events look back >>> 1/HZ >>> * in the past from the time we last checked. >>> */ >> >> Up to tc_tick/hz, not up to 1/HZ. tc_tick is the read-only sysctl >> variable kern.timecounter.tick that is set to make tc_tick/hz as close >> to 1 msec as possible. If someone asks for busy-waiting by setting >> HZ to much larger than 1000 and uses this to generate lots of timeouts, >> they probably get this now, but get*time() cannot be used even to >> distingish times differing by the timeout granularity. It is hard to >> see how it could ever work for the above use (timout scheduling with >> a granularity of ~1/hz when you can only see differences of ~tc_tick/hz, >> with tc_tick quite often 4-10, or 100-1000 to use or test corner >> cases??). With a tickless kernel, timeouts wouldn't have a fixed >> granularity, but you need accurate measurements of times even more. >> One slow way to get them is to call binuptime() again in the above. >> Another, even worse way to update timecounters after every timeout >> expires (the update has a much higher overhead than binuptime(), so >> this will be very slow iff timeouts that expire are actually used). > > I agree with the first part, but could you tell more about tc_windup() > complexity? There are a lot of time passed since that code was written, CPUs > got faster and I have feeling that cost of that math could reduce and may be > not so significant now. tc_windup() might take relatively less time on faster CPUs, but only if it is not called more often. With a tickless kernel, it should be called less often. It only needs to be called several times more often than the hardware timecounter wraps around (1/hz with hz = 100 has a few orders of magnitude to spare, except with an i8254 timecounter it has at most a factor of 5 to spare), and perhaps at least once per second for ntp processing. > May be at least tc_windup() could be refactored to separate time updating > (making it's cost closet to single binuptime() call) and all other fancy > (complicated) things? New eventtimers(4) subsystem uses binuptime() a lot. As > soon as we already reading relatively slow timecounter hardware, it would be > nice to get some benefit from it. tc_windup() is hard to refactor. It depends on not being called very often for its time domain locking to work. Note that it has no explicit locking and not even memory access ordering to ensure that its generation count is ordered relative to the data that it protects. I'm not sure how intentional the latter is, and it seems to be too simple to work in all cases. The writes to the generation count are: th->th_generation = 0; /* th is now dead, modulo races */ // update *th th->th_generation = ogen; /* * th is now live, modulo races, but is only reachable via very * old pointers. See binuptime(). It takes blocking >= 9/hz * seconds for the generation count count to do anything. */ // irrelevant stuff timehands = th; /* * th is now live, modulo races. Now it doesn't take any * blocking to get the races (just a too-new pointer). */ This depends on write ordering. At least amd64 and i386 have strict write ordering (except for write-combining memory). When I started writing this, I thought that the time domain generation was much stronger. Just keeping readers 1 generation behind this writer would give the writes 1-10 msec to become visible. There are 10 generations of timehands to handle 9-90 msec of other problems (mainly so that the window in the above in which the update is in progress is rarely hit). tc_windup() has large software overheads and complexities. In many cases the software overheads are much smaller than the hardware overheads for a single timecounter hardware read, but if you call it a lot it will need more locking. Even mutex locking is considered too expensive for binuptime(), etc. I think you don't need more than about 0.0001-0.001% of bintime()'s normal accuracy for event timers (100000 parts per million instead of 1-10 ppm). Hardly anyone noticed when the lapic timecounter was miscalibrated by 10% for several years on several FreeBSD cluster machines. This made all timeouts 10% too short or 10% too long. If a timeout is 10% too long, then there is no way to recover, but if it is 10% too short then some places in the kernel that use timeouts, notably nanosleep(), recovers by sleeping for what it thinks is the remaining time. This will probably be 10% short too, leaving 1% of the original timeout remaining. Eventually this converges to a timeout only slightly longer than original one. But most important uses of timeouts are in device drivers. I think few or no drivers know that timeouts may be off by 10% or try to recover from this. They just I think the problems are that after a long sleep (or even any CPU activity that doesn't call a timer function), you don't know what the time is, and after a short sleep, you don't know if the sleep was short without calling determining the time accurately. All interrupts may have stopped and even timecounter and cpu_ticks() hardware may have stopped, so normal methods for determining the time might not work at all. But if the sleep was short and shallow then cpu_ticks() probably works across it, and if the sleep was long then determining the new time precisely after it is a tiny part of resuming everything. BTW, AFAIK determing the time precisely, and resuming long timeouts, are quite broken after long sleeps. I don't care and haven't really tested this since I don't have any systems that can suspend properly under FreeBSD. But short timeouts can be handled reasonably after a long sleep either by completing them immediately after the sleep (with some jitter to avoid thundering herds) or by extending them by the length of the sleep. I think the latter is what happens now. It is what breaks long timeouts. Bruce