Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Jun 2012 17:48:12 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Fast gettimeofday(2) and clock_gettime(2)
Message-ID:  <20120608155521.S1201@besplex.bde.org>
In-Reply-To: <20120607172839.GZ85127@deviant.kiev.zoral.com.ua>
References:  <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 7 Jun 2012, Konstantin Belousov wrote:

> On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote:
>> On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote:
>>> On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote:
>>>> In general this looks good but I see a few nits / races:
>>>>
>>>> 1) You don't follow the model of clearing tk_current to 0 while you
>>>>    are updating the structure that the in-kernel timecounter code
>>>>    uses.  This also means you have to avoid using a tk_current of 0
>>>>    and that userland has to keep spinning as long as tk_current is 0.
>>>>    Without this I believe userland can read a partially updated
>>>>    structure.
>>> I changed the code to be much more similar to the kern_tc.c. I (re)added
>>> the generation field, which is set to 0 upon kernel touching timehands.
>>
>> Thank you.  BTW, I think we should use atomic_load_acq_int() on both accesses
>> to th_gen (and the in-kernel binuptime should do the same).  I realize this
>> requires using rmb before the while condition in userland since we can't
>> use atomic_load_acq_int() here.  I think it should also use
>> atomic_store_rel_int() for both stores to th_gen during the tc_windup()
>> callback.

The atomic_load_acq_int() (or rmb()) would completely defeat one of
the main points in the design of binuptime(), which was to be lock-free
so as to be efficient (the atomic_store_rel_int() is rarely done so
fixing it doesn't affect efficiency, especially on x86 after kib's
recent changes removed the serialization from it).  However, I now think
the acq part of the load is needed even on x86.  x86 allows loads out of
order, except in the case where the load is from the same address of a
previous store.  So no explicit memory barrier is needed (on x86) for
loads of th_generation to be ordered relative to stores to th_generation.
But read barriers seem to be needed for loads of the variables protected
by th_generation to be ordered relative to loads of th_generation.  An
acq barrier for th_generation works somewhat bogusly (on x86) by supplying
a barrier for the one variable that doesn't need it for ordering.

The correct fix seems to be to use time-domain locking even more: set the
timehands pointer to the previous generation instead of the current one.
Modulo other bugs, this gives >= 1 msec for the previous generation to
stabilize.  Speculative loads would have to be more than 1 msec in the
past to cause problems.  But they can't be, since the thread must have
been preempted for its speculative load to live that long, and the
preemption would/should have issued a barrier instruction.  Except when
the speculative load reaches a register before the preemption -- that case
is handled by the generation count: since the timehands being used must
be more than 1 generation behind for its th_generation to change, the
memory barrier instruction for the preemption ensures that the change to
th_generation is seen, so the new timehands is loaded.

Second thoughts about whether x86 needs the acq barrier: stores to all
the variables in tc_windup() are ordered by x86 memory semantics.  This
gives them a good ordering relative to the stores to th_generation, or
at least can do this.  A similar ordering is then forced for the loads
in binuptime() etc, since x86 memory semantics ensure that each load
occurs after the corresponding store to the same address.  Maybe this
is enough, or can be made to be enough with a more careful ordering of
the stores.  This is MD and hard to understand.

> This is done. On the other hand, I removed a store_rel from updating
> tk_current, since it is after enabling store to th_gen, and the order
> there does not matter.

Sigh.  The extremeness of some locking pessimizations on an Athlon64 i386
UP are:

     rdtsc                                takes  6.5 cycles
     rdtsc; movl mem,%ecx                 takes  6.5 cycles
     xchgl mem,%ecx                       takes 32 cycles
     rdtsc; lfence; movl mem,%ecx         takes 34 cycles
     rdtsc; xchgl mem,%ecx                takes 38 cycles
     xchgl mem,%ecx; rdtsc                takes 40 cycles
     xchgl mem,%eax; rdtsc                takes 40 cycles
     rdtsc; xchgl mem,%eax                takes 44 cycles
     rdtsc; mfence; movl mem,%ecx         takes 52 cycles

So the software overheads are 5-8 times larger than the hardware overheads
for a TSC timecounter, even when we only lock a single load.  Later CPUs
have much slower rdtsc, taking 40+ cycles, so the software overheads are
relatively smaller, especially since they are mostly in parallel with
the slow rdtsc.  On core2 i386 SMP:

     rdtsc                                takes 65 cycles (yes, 10x slower)
     rdtsc movl mem,%ecx                  takes 65 cycles
     xchgl mem,%ecx                       takes 25 cycles
     rdtsc; lfence; movl mem,%ecx         takes 73 cycles
     rdtsc; xchgl mem,%ecx                takes 74 cycles
     xchgl mem,%ecx; rdtsc                takes 74 cycles
     xchgl mem,%eax; rdtsc                takes 74 cycles
     rdtsc; xchgl mem,%eax                takes 74 cycles
     rdtsc; mfence; movl mem,%ecx         takes 69 cycles (yes, beats lfence)

Note that the get*() APIs have identical locking issues, so if you fix
them by adding memory barriers they will become slower than the current
non-get*() APIs are without locking, so their existence will be more
bogus than before (except with very slow timecounter hardware).

> I also did some restructuring of the userspace, removing layers that
> Bruce did not liked. Now top-level functions directly call binuptime().
> I also shortened the preliminary operations by caching timekeep pointer.
> Its double-initialization is safe.
>
> Latest version is at
> http://people.freebsd.org/~kib/misc/moronix.4.patch

Thanks.  I didn't look at the patch.  To be happy with it, I would require:
- about 1/4 the size of the first version (6K) for at least the pure
   timecounter parts
- fix old kernel bugs:
   - figure out what needs to be done for time-domain locking
   - fix the bug reported by jhb, that times can go backwards due to old
     timehands having a slightly different frequency.
       (I tried to duplicate this in the kernel, but couldn't.  I used
       adjtime(2) with hacks to make it adjust the clock by +-0.5
       seconds/second.  A loop with "adjtime 1000; adjtime" -1000 then
       gives huge swings in the frequency.  But clock_gettime() didn't
       show any negative differences.  I think the negative difference
       can't be smaller than ~100 nsec, and since the syscall takes
       longer than that even clock wrong by a factor of 2 due to the
       hacked adjtime, it can't see negative differences.)
   - figure out what TSC-low is fixing and fix it properly.  rdtsc is
     a non-serializing instruction.  Thus it can probably appear to go
     backwards.  TSC-low normally discards a factor of 128 of the precision
     of the TSC.  At 2GHz, this reduces the precision from 0.5 nsec to
     64 nsec.  Most negative differences would become 0.  I wonder if
     TSC-low is "working" just by hiding most negative differences.
     But it can only hide most (about 127/128) of them.  E.g., if the
     out-of-order times are 64 nsec and 63.5 nsec, then after discarding
     128 low bits, the negative difference expands from -0.5 nsec to
     -64 nsec.

     Note that the large syscall overhead prevents seeing any small
     negative time differences from userland in the same way as above.
     But without that overhead, either in the kernel or in moronix
     userland, small negative time differences might be visible,
     depending on how small they are and on how efficient the functions
     are.

     TSC-low also breaks seeing small positive differences.  This
     breakage if it is not hidden by syscall overhead or inefficient
     functions.  For some uses, truncation small positive differences
     to 0 is just as bad as negative differences -- you can't distinguish
     separate events using their timestamps.  Unfortunately, timecounters
     with low resolution have this problem unavoidably.  A TSC should
     at least be able to distinguish events that are separate at the
     cycle level, though since the x86 TSC is non-serializing it has a
     different tyoe of fuzziness.  This fuzziness shouldn't be fixed
     by adding serialization instructions for it (though one for acq
     may do this accidentally), since that woukld make it much slower.
     rdtscp should rarely be used since it is serializing so it gives
     similar slowness.  Does it do any more than "cpuid; rdtsc"?

> I probably move all shared page helpers to separate file from kern_exec.c,
> but this will happen after moronix is committed.

It's still moronix?  Why would we want that? :-)

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120608155521.S1201>