From owner-freebsd-arch@FreeBSD.ORG Tue Mar 5 10:43:50 2013 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 08F098C7; Tue, 5 Mar 2013 10:43:50 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id 241C2622; Tue, 5 Mar 2013 10:43:48 +0000 (UTC) Received: from mail17.syd.optusnet.com.au (mail17.syd.optusnet.com.au [211.29.132.198]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r25AhfEo002465; Tue, 5 Mar 2013 21:43:41 +1100 Received: from c211-30-173-106.carlnfd1.nsw.optusnet.com.au (c211-30-173-106.carlnfd1.nsw.optusnet.com.au [211.30.173.106]) by mail17.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r25AhUe1023287 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 5 Mar 2013 21:43:32 +1100 Date: Tue, 5 Mar 2013 21:43:30 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Alexander Motin Subject: Re: tickless design guidelines In-Reply-To: <5135B934.5060202@FreeBSD.org> Message-ID: <20130305204500.T1059@besplex.bde.org> References: <20130305080134.GC13187@onelab2.iet.unipi.it> <5135AFAD.70408@FreeBSD.org> <20130305090735.GB18221@onelab2.iet.unipi.it> <5135B934.5060202@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=DdhPMYRW c=1 sm=1 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=ZGdlnkk9SXAA:10 a=u4QguocfA062RjAEVIMA:9 a=CjuIK1q_8ugA:10 a=fZR5MC7dhzonfZNR:21 a=1N-mQgnlXDl8t-hT:21 a=TEtd8y5WR3g2ypngnwZWYw==:117 Cc: Davide Italiano , arch@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 05 Mar 2013 10:43:50 -0000 On Tue, 5 Mar 2013, Alexander Motin wrote: > On 05.03.2013 11:07, Luigi Rizzo wrote: >> Also i wonder if it may make sense to add a feature so that whenever >> we get an interrupt and a fast and suitable timecounter is available, >> some system-wide bintime is updated. bintime is intentionally updated as little as possible, since updating it is an expensive operation that would need even more expensive locking if it is done often. Note that get*time() is fundamentally broken, since the timestamps that it returns are incoherent with timestamps returned by non-get*time(). I used to sync get*time() with non-get*time() on every call to the latter. E.g., on every call to bintime(), time_second, timehands->th_microtime and timehands->th_nanotime are updated with the time just read by binuptime(). getbintime(), etc., use the possibly-updated values. This makes getbintime(), etc., more accurate if calls to them are mixed with calls to bintime(), but this is a side effect. The sync is done for correctness. It ensures that the time is monotonic when read using any mixture of get/non-get*time() APIs. File timestamps written using the value of time_second or getnanotime() are noticebly broken when compared with the real time read using gettimeofday() without this.Z. But this is slow, and phk didn't like it. I only did it for the !SMP cases and dropped it after a few years. Your suggestion is similar. Updating the system-wide bintime requires knowing the current bintime, and nothing much more or less than a very recent read of a hardware timecounter gives that. The difficulty is atomically updating the whole timecounter state with values derived from that. >> So getbinuptime() could just use this counter, becoming extremely >> cheap (perhaps it is already, i am not sure) and in the long term, >> as CPUs with fixed frequency TSC become ubiquitous, >> we would get higher resolution as the interrupt load increases. getbinuptime() is already extremely cheap, but imprecise and incoherent. If you assume a fixed frequency TSC, then you can just always use bintime(). Unfortunately, the same CPUs that have a fixed frequency TSC also have slow-to-read TSCs. The TSC read time increased from ~9 cycles on AthlonXP and Athlon64 to 42 on Phenom, and it core2 is even slower. 9 cycles were almost in the noise, so getbintime() was only slightly faster than bintime() on Athlon64. > Each time timer interrupt fires, or CPU timers "switched" between idle > and active modes present time is fetched. The only question is how to > store it somewhere safely. It was difficult for 96/128-bit struct > bintime. It is easier for 64-bit sbintime_t, but still not all archs are > able to do it atomically. The mechanism used for getbinuptime() now may > not scale well to the very high update rates. getbinuptime() with high update rates would reduce to a bad version of binuptime(). Actually, getbinuptime() would probably be unchanged. The updates would be in binuptime(). You could update only every 10 or 100 usec to get not so high update rates but not coherency. Actually^2, timekeeping is unimportant for timeouts. Thus you could use sloppy non-atomic updates for everything and barely notice losing the races for this. To get semi-coherent and more accurate (when the races aren't lost) in sbinuptime(), just record a delta-time on every hardware timecounter call. Arrange to call hardware timecounters fairly often, and add the delta time in sbinuptime(). Better not touch getbinuptime(), etc. Let them remain mostly race-free and incoherent. Here is my old syncing code: @ diff -c2 src/sys/kern/kern_tc.c~ src/sys/kern/kern_tc.c @ *** src/sys/kern/kern_tc.c~ Fri Mar 5 17:03:07 2004 @ --- src/sys/kern/kern_tc.c Fri Mar 5 17:03:08 2004 @ *************** @ *** 172,175 **** @ --- 224,254 ---- @ } @ @ + #define SYNC_TIME @ + @ + #if defined(SYNC_TIME) && !defined(SMP) @ + static void @ + sync_time(long sec, long nsec, long usec) nanotime() and microtime() call here with the time that they just read. I don't bother syncing full bintimes, since this was mainly intended to fix file timestamps. @ + { @ + register_t saveintr; @ + @ + saveintr = intr_disable(); Locking for !SMP only. @ + if (time_second < sec) @ + time_second = sec; @ + if (timehands->th_microtime.tv_sec < sec || @ + (timehands->th_microtime.tv_sec == sec && @ + timehands->th_microtime.tv_usec < usec)) { @ + timehands->th_microtime.tv_sec = sec; @ + timehands->th_microtime.tv_usec = usec; @ + } @ + if (timehands->th_nanotime.tv_sec < sec || @ + (timehands->th_nanotime.tv_sec == sec && @ + timehands->th_nanotime.tv_nsec < nsec)) { @ + timehands->th_nanotime.tv_sec = sec; @ + timehands->th_nanotime.tv_nsec = nsec; @ + } The secondary values are now coherent for use in getnanotime(), etc. sbintime() support could set the final value here instead of a delta. The reason to use a delta is that it fits in 32 bits so is easy to access atomically, so that the write here doesn't need much locking. Oops, the locking here is inadequate even for !SMP -- the updates here may be non-monotonic since the lock is not around the whole timcounter-read...update sequence. @ + intr_restore(saveintr); @ + } @ + #endif /* SYNC_TIME && !SMP */ @ + @ void @ bintime(struct bintime *bt) @ *************** @ *** 189,192 **** @ --- 268,274 ---- @ bintime(&bt); @ bintime2timespec(&bt, tsp); @ + #if defined(SYNC_TIME) && !defined(SMP) @ + sync_time(tsp->tv_sec, tsp->tv_nsec, tsp->tv_nsec / 1000); @ + #endif @ } @ Any call to bintime() syncs time_second, and the timespec and timeval in the timehands, but not the full bintime. So for example, getnanotime() benefits from bintime() being called, but getbintime() remains incoherent with bintime(). Bruce