Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Jun 2012 22:43:38 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Fast gettimeofday(2) and clock_gettime(2)
Message-ID:  <20120608215043.Q2736@besplex.bde.org>
In-Reply-To: <20120608112850.GE85127@deviant.kiev.zoral.com.ua>
References:  <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> <20120608155521.S1201@besplex.bde.org> <20120608112850.GE85127@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 8 Jun 2012, Konstantin Belousov wrote:

> On Fri, Jun 08, 2012 at 05:48:12PM +1000, Bruce Evans wrote:
>> On Thu, 7 Jun 2012, Konstantin Belousov wrote:
>>
>>> On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote:
>>>> 
>>>> Thank you.  BTW, I think we should use atomic_load_acq_int() on both
>>>> accesses
>>>> to th_gen (and the in-kernel binuptime should do the same).  I realize
>>>> ...
>>
>> The atomic_load_acq_int() (or rmb()) would completely defeat one of
>> the main points in the design of binuptime(), which was to be lock-free
>> ...
>> by th_generation to be ordered relative to loads of th_generation.  An
>> acq barrier for th_generation works somewhat bogusly (on x86) by supplying
>> a barrier for the one variable that doesn't need it for ordering.
> load_acq is not a lock, it is serialization.

By "lock-free", I meant "free of all types of locks and atomic ops,
including for example the x86 lock prefix which is not a lock but is
often used to implement locks via its serialization properties".  Then
I wrote "barrier", but noted that this is acting strangely by turning
th_generation into a locking gate that locks other variables.  I think
this depends on th_generation being loaded first (in program order) in
binuptime() etc.  It would be more natural to put the read barrier before
the first read of another variable.

>> Second thoughts about whether x86 needs the acq barrier: stores to all
>> the variables in tc_windup() are ordered by x86 memory semantics.  This
>> gives them a good ordering relative to the stores to th_generation, or
>> at least can do this.  A similar ordering is then forced for the loads
>> in binuptime() etc, since x86 memory semantics ensure that each load
>> occurs after the corresponding store to the same address.  Maybe this
>> is enough, or can be made to be enough with a more careful ordering of
>> the stores.  This is MD and hard to understand.
> The ordering of loads reg. stores to the same address only happens
> on the same core.

So my first thoughts were better.

> On x86, loads cannot be reordered with other loads,
> but potentially this could happen on other arches.

I think you mean stores cannot be reordered with other stores.

>>> This is done. On the other hand, I removed a store_rel from updating
>>> tk_current, since it is after enabling store to th_gen, and the order
>>> there does not matter.
>>
>> Sigh.  The extremeness of some locking pessimizations on an Athlon64 i386
>> UP are:
>>
>>     rdtsc                                takes  6.5 cycles
>>     rdtsc; movl mem,%ecx                 takes  6.5 cycles
>>     xchgl mem,%ecx                       takes 32 cycles
>>     rdtsc; lfence; movl mem,%ecx         takes 34 cycles
>>     rdtsc; xchgl mem,%ecx                takes 38 cycles
>>     xchgl mem,%ecx; rdtsc                takes 40 cycles
>>     xchgl mem,%eax; rdtsc                takes 40 cycles
>>     rdtsc; xchgl mem,%eax                takes 44 cycles
>>     rdtsc; mfence; movl mem,%ecx         takes 52 cycles

All except the first 2 here are twice as high as they should be.

>> So the software overheads are 5-8 times larger than the hardware overheads
>> for a TSC timecounter, even when we only lock a single load.  Later CPUs

2.5-4 times.

>> have much slower rdtsc, taking 40+ cycles, so the software overheads are
>> relatively smaller, especially since they are mostly in parallel with
>> the slow rdtsc.  On core2 i386 SMP:
> I suspect that what you measured for fence overhead is actually a time
> to retire whole queue or read (and/or write) requests accumulated so far
> in the pipeline, and not the overhead of synchronous rdtsc read.

Yes, full serialization probably takes much longer.  I don't know of any
better serialization instruction than cpuid (if rdtscp is not available).
More times for Athlon64:

     rdtsc                                takes  6.5 cycles
     lfence; rdtsc                        takes 17 cycles
     rdtsc; lfence; movl mem,%ecx         takes 17 cycles (correction of above)
     cpuid; rdtsc                         takes 63 cycles

>>     rdtsc                                takes 65 cycles (yes, 10x slower)
>>     rdtsc movl mem,%ecx                  takes 65 cycles
>>     xchgl mem,%ecx                       takes 25 cycles
>>     rdtsc; lfence; movl mem,%ecx         takes 73 cycles
>>     rdtsc; xchgl mem,%ecx                takes 74 cycles
>>     xchgl mem,%ecx; rdtsc                takes 74 cycles
>>     xchgl mem,%eax; rdtsc                takes 74 cycles
>>     rdtsc; xchgl mem,%eax                takes 74 cycles
>>     rdtsc; mfence; movl mem,%ecx         takes 69 cycles (yes, beats lfence)

These times (for core2) are correct.  Now with cpuid:

     rdtsc                                takes  6.5 cycles
     lfence; rdtsc                        takes 75 cycles
     rdtsc; lfence; movl mem,%ecx         takes 73 cycles (correct above)
     cpuid; rdtsc                         takes 298 cycles (gak!)

>> - fix old kernel bugs:
>> ...
> The goal of the patch is only to move the code from kernel into userspace,
> trying not to change algorithms. The potential changes you describe
> above should be done both in kernel in usermode after that.

I think being more efficient might expose more races.  With syscalls,
small and negative time differences can't be seen since the syscall
takes longer.  With kernel calls, small and negative time differences
shouldn't happen since the kernel shouldn't be silly enough to spin
calling a timecounter function.

>>     Note that the large syscall overhead prevents seeing any small
>>     negative time differences from userland in the same way as above.
>>     But without that overhead, either in the kernel or in moronix
>>     userland, small negative time differences might be visible,
>>     depending on how small they are and on how efficient the functions
>>     are.
> So far I did not see time going backward in tight gettimeofday() loop.
> This indeed is one of my main worries.

Try taking out the shift.  I plan to try to get out of order loads using
cache misses.  Not sure how that would give out of order rdtsc's.

>> ...
>>     different tyoe of fuzziness.  This fuzziness shouldn't be fixed
>>     by adding serialization instructions for it (though one for acq
>>     may do this accidentally), since that woukld make it much slower.
>>     rdtscp should rarely be used since it is serializing so it gives
>>     similar slowness.  Does it do any more than "cpuid; rdtsc"?
> rdtscp allows to atomically get current package tsc counter and obtain
> some reference to current core identifier. If we produce 'skew tables'
> to compensate different tsc initial values and possible drift, then
> we could use tsc counter on wider range of hardware, by adjusting
> returned value from rdtsc by skew table repair addendum. Rdtscp is
> atomic in this respect.

rdtscp would be too slow if it is as slow as the above for cpuid; rdtsc.
But at least early phenom docs say that both rdtsc and rdtscp take 41+6
cycles.  I read a bit more of its documentation.  It seems to be exactly
rdtsc with the serialization and core number load, and without the
register clobbering and extra overhead of the cpuid instruction.  I
only noticed the other day, when someone fixed it, that the kernel
already has this skew adjustment in dtrace code.  The adjustment was
backwards...  The index to the skew table is curcpu.  There is a
sched_pin() in the initialization code, but none in the timer read
code, so I don't see how the latter can work right even if the adjustment
is forwards.  Unless the caller always does the sched_pin(), but that
would be slow and probably undocumented.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120608215043.Q2736>