From owner-freebsd-arch@FreeBSD.ORG Sun Jun 3 10:49:46 2012 Return-Path: Delivered-To: freebsd-arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 4E753106566B; Sun, 3 Jun 2012 10:49:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail01.syd.optusnet.com.au (mail01.syd.optusnet.com.au [211.29.132.182]) by mx1.freebsd.org (Postfix) with ESMTP id 295408FC1E; Sun, 3 Jun 2012 10:49:43 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail01.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q53AnRFV000363 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 3 Jun 2012 20:49:29 +1000 Date: Sun, 3 Jun 2012 20:49:27 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov In-Reply-To: <20120603051904.GG2358@deviant.kiev.zoral.com.ua> Message-ID: <20120603184315.T856@besplex.bde.org> References: <20120601193522.GA2358@deviant.kiev.zoral.com.ua> <20120602164847.GB2358@deviant.kiev.zoral.com.ua> <20120602171632.GC2358@deviant.kiev.zoral.com.ua> <20120603063330.H3418@besplex.bde.org> <20120603051904.GG2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Gianni , Alan Cox , Alexander Kabaev , Attilio Rao , Konstantin Belousov , freebsd-arch@FreeBSD.org Subject: Re: Fwd: [RFC] Kernel shared variables X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 03 Jun 2012 10:49:46 -0000 On Sun, 3 Jun 2012, Konstantin Belousov wrote: > On Sun, Jun 03, 2012 at 07:28:09AM +1000, Bruce Evans wrote: >> On Sat, 2 Jun 2012, Konstantin Belousov wrote: >>> ... >>> In fact, I think that if the whole goal is only fast clocks, then we >>> do not need any additional system mechanisms, since we can easily export >>> coefficients for rdtsc formula already. E.g. we can put it into elf auxv, >>> which is ugly but bearable. >> >> How do you get the timehands offsets? These only need to be updated >> every second or so, or when used, but how can the application know >> when they need to be updated if this is not done automatically in the >> kernel by writing to a shared page? I can only think of the >> application arranging an alarm signal every second or so and updating >> then. No good for libraries. > What is timehands offsets ? Do you mean things like leap seconds ? Yes. binuptime() is: % void % binuptime(struct bintime *bt) % { % struct timehands *th; % u_int gen; % % do { % th = timehands; % gen = th->th_generation; % *bt = th->th_offset; % bintime_addx(bt, th->th_scale * tc_delta(th)); % } while (gen == 0 || gen != th->th_generation); % } Without the kernel providing th->th_offset, you have to do lots of ntp handling for yourself (compatibly with the kernel) just to get an accuracy of 1 second. Leap seconds don't affect CLOCK_MONOTONIC, but they do affect CLOCK_REALTIME which is the clock id used by gettimeofday(). For the former, you only have to advance the offset for yourself occasionally (compatibly with the kernel) and manage (compatibly with the kernel, especially in the long term) ntp slewing and other syscall/sysctl kernel activity that micro-adjusts th->th_scale. > This is indeed problematic for auxv. For auxv it could be solved by > providing offset for next recheck using syscalls, and making libc code to > respect this offset. But, I do think that vdso in shared page > is the right solution, not auxv. timehands in a shared pages is close to working. th_generation protects things in the same way as in the kernel, modulo assumptions that writes are ordered. >> rdtsc is also very unportable, even on CPUs that have it. But all other >> x86 timecounter hardware is too slow if you want gettimeofday() to be fast >> and as accurate as it is now. > !rdtsc hardware is probably cannot be used at all due to need to provide > usermode access to device registers. The mere presence of rdtsc does not > means that usermode indeed can use it, it should be decided by kernel > based on the current in-kernel time source. If rdtsc is not usable, the > corresponding data should not be exported, or implementation should go > directly into syscall or whatever. But then applications would: - use gettimeofday() more than they should ("it works on Linux"), even more than now since when "it works on FreeBSD-x86" too - just be slow when gettimeofday() is slow - kludge around gettimeofday() being slow like they do now - kludge around gettimeofday() being slow not like they do now (use more complications to probe it being slow). I found some RedHat documentation for gettimeofday() in VDSO. It seems to leave it to the sysadmin to "tune" gettimeofday() using a boot parameter to configure gettimeofday() being accurate/slow, less-accurate/ less-slow, or inaccurate/fast. A per-process parameter would be more correct and harder to use (add mounds of autoconfig and runtime code in every program[mer] that cares to detect and use it). > In fact, I would be very grateful if an expert in time-keeping provided > concise description of the algorithm for translating rdtsc output into > struct timeval, also enumerating required parameters. See above. You just scale tc_delta(th) == (uint32_t)(rdtsc() - rdtsc_offset) when th is for TSC, using a carefully managed fixed point scale factor. The delta is reduced to 32 bits so that the scaling can be efficient. The result is a bintime fraction which is added to a bintime offset. Both offsets are even more carefully managed, and everything is protected by th_generation, and for optimality there are multiple timehands so that th_generation very rarely changes underneath you. The resulting bintime is then converted to a timeval or timespec as required. This gives uptimes. Another offset is added for real times. Times in seconds are handled more directly; it is assumed that time_t is atomic so that th_generation is not needed for protecting them. The TSC frequency is limited to about 4 GHz, so the above tc_delta() works for about 4 seconds after rdtsc_offset is updated. But the bintime fraction only works for 1 second. If either of these wraps, then the result is still latter than the update time; however, it may be earlier than a previous result. So the update must occur at least once per second for the TSC. Otherwise, negative time differences occur (the final result is in advance of th_offset since the bintime fraction is >= 0, but will be before a previous final result if the bintime fraction wraps). Negative time differences are more worse than lost "ticks" that cause all results to be in the past. The updates are broken by at least stopping in ddb and perhaps by suspend/resume. The correct fix is probably to update (or just zap) the timecounter as the first step of resuming from ddb or sleep (this must be done before any other timecounter call). Note that times going backwards cannot detected in binuptime(), etc., since to detect it you would have to write the previous time, but that would requires pessimal locking that is intentionally left out. Timecounter internals like th_offsets are currently private in kern_tc.c. I don't like exposing them for this, or cloning them for FFCLOCK. Bruce