Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 8 Jun 2012 14:28:50 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        freebsd-arch@freebsd.org
Subject:   Re: Fast gettimeofday(2) and clock_gettime(2)
Message-ID:  <20120608112850.GE85127@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120608155521.S1201@besplex.bde.org>
References:  <20120606165115.GQ85127@deviant.kiev.zoral.com.ua> <201206061423.53179.jhb@freebsd.org> <20120606205938.GS85127@deviant.kiev.zoral.com.ua> <201206070850.55751.jhb@freebsd.org> <20120607172839.GZ85127@deviant.kiev.zoral.com.ua> <20120608155521.S1201@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--mu4SaHkdL1Az71rA
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 08, 2012 at 05:48:12PM +1000, Bruce Evans wrote:
> On Thu, 7 Jun 2012, Konstantin Belousov wrote:
>=20
> >On Thu, Jun 07, 2012 at 08:50:55AM -0400, John Baldwin wrote:
> >>On Wednesday, June 06, 2012 4:59:38 pm Konstantin Belousov wrote:
> >>>On Wed, Jun 06, 2012 at 02:23:53PM -0400, John Baldwin wrote:
> >>>>In general this looks good but I see a few nits / races:
> >>>>
> >>>>1) You don't follow the model of clearing tk_current to 0 while you
> >>>>   are updating the structure that the in-kernel timecounter code
> >>>>   uses.  This also means you have to avoid using a tk_current of 0
> >>>>   and that userland has to keep spinning as long as tk_current is 0.
> >>>>   Without this I believe userland can read a partially updated
> >>>>   structure.
> >>>I changed the code to be much more similar to the kern_tc.c. I (re)add=
ed
> >>>the generation field, which is set to 0 upon kernel touching timehands.
> >>
> >>Thank you.  BTW, I think we should use atomic_load_acq_int() on both=20
> >>accesses
> >>to th_gen (and the in-kernel binuptime should do the same).  I realize=
=20
> >>this
> >>requires using rmb before the while condition in userland since we can't
> >>use atomic_load_acq_int() here.  I think it should also use
> >>atomic_store_rel_int() for both stores to th_gen during the tc_windup()
> >>callback.
>=20
> The atomic_load_acq_int() (or rmb()) would completely defeat one of
> the main points in the design of binuptime(), which was to be lock-free
> so as to be efficient (the atomic_store_rel_int() is rarely done so
> fixing it doesn't affect efficiency, especially on x86 after kib's
> recent changes removed the serialization from it).  However, I now think
> the acq part of the load is needed even on x86.  x86 allows loads out of
> order, except in the case where the load is from the same address of a
> previous store.  So no explicit memory barrier is needed (on x86) for
> loads of th_generation to be ordered relative to stores to th_generation.
> But read barriers seem to be needed for loads of the variables protected
> by th_generation to be ordered relative to loads of th_generation.  An
> acq barrier for th_generation works somewhat bogusly (on x86) by supplying
> a barrier for the one variable that doesn't need it for ordering.
load_acq is not a lock, it is serialization.

>=20
> The correct fix seems to be to use time-domain locking even more: set the
> timehands pointer to the previous generation instead of the current one.
> Modulo other bugs, this gives >=3D 1 msec for the previous generation to
> stabilize.  Speculative loads would have to be more than 1 msec in the
> past to cause problems.  But they can't be, since the thread must have
> been preempted for its speculative load to live that long, and the
> preemption would/should have issued a barrier instruction.  Except when
> the speculative load reaches a register before the preemption -- that case
> is handled by the generation count: since the timehands being used must
> be more than 1 generation behind for its th_generation to change, the
> memory barrier instruction for the preemption ensures that the change to
> th_generation is seen, so the new timehands is loaded.
>=20
> Second thoughts about whether x86 needs the acq barrier: stores to all
> the variables in tc_windup() are ordered by x86 memory semantics.  This
> gives them a good ordering relative to the stores to th_generation, or
> at least can do this.  A similar ordering is then forced for the loads
> in binuptime() etc, since x86 memory semantics ensure that each load
> occurs after the corresponding store to the same address.  Maybe this
> is enough, or can be made to be enough with a more careful ordering of
> the stores.  This is MD and hard to understand.
The ordering of loads reg. stores to the same address only happens
on the same core. On x86, loads cannot be reordered with other loads,
but potentially this could happen on other arches.

>=20
> >This is done. On the other hand, I removed a store_rel from updating
> >tk_current, since it is after enabling store to th_gen, and the order
> >there does not matter.
>=20
> Sigh.  The extremeness of some locking pessimizations on an Athlon64 i386
> UP are:
>=20
>     rdtsc                                takes  6.5 cycles
>     rdtsc; movl mem,%ecx                 takes  6.5 cycles
>     xchgl mem,%ecx                       takes 32 cycles
>     rdtsc; lfence; movl mem,%ecx         takes 34 cycles
>     rdtsc; xchgl mem,%ecx                takes 38 cycles
>     xchgl mem,%ecx; rdtsc                takes 40 cycles
>     xchgl mem,%eax; rdtsc                takes 40 cycles
>     rdtsc; xchgl mem,%eax                takes 44 cycles
>     rdtsc; mfence; movl mem,%ecx         takes 52 cycles
>=20
> So the software overheads are 5-8 times larger than the hardware overheads
> for a TSC timecounter, even when we only lock a single load.  Later CPUs
> have much slower rdtsc, taking 40+ cycles, so the software overheads are
> relatively smaller, especially since they are mostly in parallel with
> the slow rdtsc.  On core2 i386 SMP:
I suspect that what you measured for fence overhead is actually a time
to retire whole queue or read (and/or write) requests accumulated so far
in the pipeline, and not the overhead of synchronous rdtsc read.

>=20
>     rdtsc                                takes 65 cycles (yes, 10x slower)
>     rdtsc movl mem,%ecx                  takes 65 cycles
>     xchgl mem,%ecx                       takes 25 cycles
>     rdtsc; lfence; movl mem,%ecx         takes 73 cycles
>     rdtsc; xchgl mem,%ecx                takes 74 cycles
>     xchgl mem,%ecx; rdtsc                takes 74 cycles
>     xchgl mem,%eax; rdtsc                takes 74 cycles
>     rdtsc; xchgl mem,%eax                takes 74 cycles
>     rdtsc; mfence; movl mem,%ecx         takes 69 cycles (yes, beats lfen=
ce)
>=20
> Note that the get*() APIs have identical locking issues, so if you fix
> them by adding memory barriers they will become slower than the current
> non-get*() APIs are without locking, so their existence will be more
> bogus than before (except with very slow timecounter hardware).
>=20
> >I also did some restructuring of the userspace, removing layers that
> >Bruce did not liked. Now top-level functions directly call binuptime().
> >I also shortened the preliminary operations by caching timekeep pointer.
> >Its double-initialization is safe.
> >
> >Latest version is at
> >http://people.freebsd.org/~kib/misc/moronix.4.patch
>=20
> Thanks.  I didn't look at the patch.  To be happy with it, I would requir=
e:
> - about 1/4 the size of the first version (6K) for at least the pure
>   timecounter parts
This might already happen, since I removed the layering you did not
liked, from usermode.

> - fix old kernel bugs:
>   - figure out what needs to be done for time-domain locking
>   - fix the bug reported by jhb, that times can go backwards due to old
>     timehands having a slightly different frequency.
>       (I tried to duplicate this in the kernel, but couldn't.  I used
>       adjtime(2) with hacks to make it adjust the clock by +-0.5
>       seconds/second.  A loop with "adjtime 1000; adjtime" -1000 then
>       gives huge swings in the frequency.  But clock_gettime() didn't
>       show any negative differences.  I think the negative difference
>       can't be smaller than ~100 nsec, and since the syscall takes
>       longer than that even clock wrong by a factor of 2 due to the
>       hacked adjtime, it can't see negative differences.)
>   - figure out what TSC-low is fixing and fix it properly.  rdtsc is
>     a non-serializing instruction.  Thus it can probably appear to go
>     backwards.  TSC-low normally discards a factor of 128 of the precision
>     of the TSC.  At 2GHz, this reduces the precision from 0.5 nsec to
>     64 nsec.  Most negative differences would become 0.  I wonder if
>     TSC-low is "working" just by hiding most negative differences.
>     But it can only hide most (about 127/128) of them.  E.g., if the
>     out-of-order times are 64 nsec and 63.5 nsec, then after discarding
>     128 low bits, the negative difference expands from -0.5 nsec to
>     -64 nsec.
The goal of the patch is only to move the code from kernel into userspace,
trying not to change algorithms. The potential changes you describe
above should be done both in kernel in usermode after that.

>=20
>     Note that the large syscall overhead prevents seeing any small
>     negative time differences from userland in the same way as above.
>     But without that overhead, either in the kernel or in moronix
>     userland, small negative time differences might be visible,
>     depending on how small they are and on how efficient the functions
>     are.
So far I did not see time going backward in tight gettimeofday() loop.
This indeed is one of my main worries.

>=20
>     TSC-low also breaks seeing small positive differences.  This
>     breakage if it is not hidden by syscall overhead or inefficient
>     functions.  For some uses, truncation small positive differences
>     to 0 is just as bad as negative differences -- you can't distinguish
>     separate events using their timestamps.  Unfortunately, timecounters
>     with low resolution have this problem unavoidably.  A TSC should
>     at least be able to distinguish events that are separate at the
>     cycle level, though since the x86 TSC is non-serializing it has a
>     different tyoe of fuzziness.  This fuzziness shouldn't be fixed
>     by adding serialization instructions for it (though one for acq
>     may do this accidentally), since that woukld make it much slower.
>     rdtscp should rarely be used since it is serializing so it gives
>     similar slowness.  Does it do any more than "cpuid; rdtsc"?
rdtscp allows to atomically get current package tsc counter and obtain
some reference to current core identifier. If we produce 'skew tables'
to compensate different tsc initial values and possible drift, then
we could use tsc counter on wider range of hardware, by adjusting
returned value from rdtsc by skew table repair addendum. Rdtscp is
atomic in this respect.

>=20
> >I probably move all shared page helpers to separate file from kern_exec.=
c,
> >but this will happen after moronix is committed.
>=20
> It's still moronix?  Why would we want that? :-)
>=20
> Bruce

--mu4SaHkdL1Az71rA
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAk/R4fIACgkQC3+MBN1Mb4hUfwCfWVsDpo5c5y89qYoQ8fjjnZcJ
NZEAoMdzFuDdVtdE4xlacrcpES1fJ5Zr
=94F2
-----END PGP SIGNATURE-----

--mu4SaHkdL1Az71rA--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120608112850.GE85127>