Date: Fri, 8 Jun 2012 22:43:38 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: freebsd-arch@freebsd.org Subject: Re: Fast gettimeofday(2) and clock_gettime(2) Message-ID: <20120608215043.Q2736@besplex.bde.org> In-Reply-To: <20120608112850.GE85127@deviant.kiev.zoral.com.ua> References: <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> <20120608155521.S1201@besplex.bde.org> <20120608112850.GE85127@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 8 Jun 2012, Konstantin Belousov wrote: > On Fri, Jun 08, 2012 at 05:48:12PM +1000, Bruce Evans wrote: >> On Thu, 7 Jun 2012, Konstantin Belousov wrote: >> >>> On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote: >>>> >>>> Thank you. BTW, I think we should use atomic_load_acq_int() on both >>>> accesses >>>> to th_gen (and the in-kernel binuptime should do the same). I realize >>>> ... >> >> The atomic_load_acq_int() (or rmb()) would completely defeat one of >> the main points in the design of binuptime(), which was to be lock-free >> ... >> by th_generation to be ordered relative to loads of th_generation. An >> acq barrier for th_generation works somewhat bogusly (on x86) by supplying >> a barrier for the one variable that doesn't need it for ordering. > load_acq is not a lock, it is serialization. By "lock-free", I meant "free of all types of locks and atomic ops, including for example the x86 lock prefix which is not a lock but is often used to implement locks via its serialization properties". Then I wrote "barrier", but noted that this is acting strangely by turning th_generation into a locking gate that locks other variables. I think this depends on th_generation being loaded first (in program order) in binuptime() etc. It would be more natural to put the read barrier before the first read of another variable. >> Second thoughts about whether x86 needs the acq barrier: stores to all >> the variables in tc_windup() are ordered by x86 memory semantics. This >> gives them a good ordering relative to the stores to th_generation, or >> at least can do this. A similar ordering is then forced for the loads >> in binuptime() etc, since x86 memory semantics ensure that each load >> occurs after the corresponding store to the same address. Maybe this >> is enough, or can be made to be enough with a more careful ordering of >> the stores. This is MD and hard to understand. > The ordering of loads reg. stores to the same address only happens > on the same core. So my first thoughts were better. > On x86, loads cannot be reordered with other loads, > but potentially this could happen on other arches. I think you mean stores cannot be reordered with other stores. >>> This is done. On the other hand, I removed a store_rel from updating >>> tk_current, since it is after enabling store to th_gen, and the order >>> there does not matter. >> >> Sigh. The extremeness of some locking pessimizations on an Athlon64 i386 >> UP are: >> >> rdtsc takes 6.5 cycles >> rdtsc; movl mem,%ecx takes 6.5 cycles >> xchgl mem,%ecx takes 32 cycles >> rdtsc; lfence; movl mem,%ecx takes 34 cycles >> rdtsc; xchgl mem,%ecx takes 38 cycles >> xchgl mem,%ecx; rdtsc takes 40 cycles >> xchgl mem,%eax; rdtsc takes 40 cycles >> rdtsc; xchgl mem,%eax takes 44 cycles >> rdtsc; mfence; movl mem,%ecx takes 52 cycles All except the first 2 here are twice as high as they should be. >> So the software overheads are 5-8 times larger than the hardware overheads >> for a TSC timecounter, even when we only lock a single load. Later CPUs 2.5-4 times. >> have much slower rdtsc, taking 40+ cycles, so the software overheads are >> relatively smaller, especially since they are mostly in parallel with >> the slow rdtsc. On core2 i386 SMP: > I suspect that what you measured for fence overhead is actually a time > to retire whole queue or read (and/or write) requests accumulated so far > in the pipeline, and not the overhead of synchronous rdtsc read. Yes, full serialization probably takes much longer. I don't know of any better serialization instruction than cpuid (if rdtscp is not available). More times for Athlon64: rdtsc takes 6.5 cycles lfence; rdtsc takes 17 cycles rdtsc; lfence; movl mem,%ecx takes 17 cycles (correction of above) cpuid; rdtsc takes 63 cycles >> rdtsc takes 65 cycles (yes, 10x slower) >> rdtsc movl mem,%ecx takes 65 cycles >> xchgl mem,%ecx takes 25 cycles >> rdtsc; lfence; movl mem,%ecx takes 73 cycles >> rdtsc; xchgl mem,%ecx takes 74 cycles >> xchgl mem,%ecx; rdtsc takes 74 cycles >> xchgl mem,%eax; rdtsc takes 74 cycles >> rdtsc; xchgl mem,%eax takes 74 cycles >> rdtsc; mfence; movl mem,%ecx takes 69 cycles (yes, beats lfence) These times (for core2) are correct. Now with cpuid: rdtsc takes 6.5 cycles lfence; rdtsc takes 75 cycles rdtsc; lfence; movl mem,%ecx takes 73 cycles (correct above) cpuid; rdtsc takes 298 cycles (gak!) >> - fix old kernel bugs: >> ... > The goal of the patch is only to move the code from kernel into userspace, > trying not to change algorithms. The potential changes you describe > above should be done both in kernel in usermode after that. I think being more efficient might expose more races. With syscalls, small and negative time differences can't be seen since the syscall takes longer. With kernel calls, small and negative time differences shouldn't happen since the kernel shouldn't be silly enough to spin calling a timecounter function. >> Note that the large syscall overhead prevents seeing any small >> negative time differences from userland in the same way as above. >> But without that overhead, either in the kernel or in moronix >> userland, small negative time differences might be visible, >> depending on how small they are and on how efficient the functions >> are. > So far I did not see time going backward in tight gettimeofday() loop. > This indeed is one of my main worries. Try taking out the shift. I plan to try to get out of order loads using cache misses. Not sure how that would give out of order rdtsc's. >> ... >> different tyoe of fuzziness. This fuzziness shouldn't be fixed >> by adding serialization instructions for it (though one for acq >> may do this accidentally), since that woukld make it much slower. >> rdtscp should rarely be used since it is serializing so it gives >> similar slowness. Does it do any more than "cpuid; rdtsc"? > rdtscp allows to atomically get current package tsc counter and obtain > some reference to current core identifier. If we produce 'skew tables' > to compensate different tsc initial values and possible drift, then > we could use tsc counter on wider range of hardware, by adjusting > returned value from rdtsc by skew table repair addendum. Rdtscp is > atomic in this respect. rdtscp would be too slow if it is as slow as the above for cpuid; rdtsc. But at least early phenom docs say that both rdtsc and rdtscp take 41+6 cycles. I read a bit more of its documentation. It seems to be exactly rdtsc with the serialization and core number load, and without the register clobbering and extra overhead of the cpuid instruction. I only noticed the other day, when someone fixed it, that the kernel already has this skew adjustment in dtrace code. The adjustment was backwards... The index to the skew table is curcpu. There is a sched_pin() in the initialization code, but none in the timer read code, so I don't see how the latter can work right even if the adjustment is forwards. Unless the caller always does the sched_pin(), but that would be slow and probably undocumented. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120608215043.Q2736>