Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 26 Jul 2012 13:49:18 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        Jim Harris <jimharris@freebsd.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, Andriy Gapon <avg@freebsd.org>, svn-src-head@freebsd.org, Jung-uk Kim <jkim@freebsd.org>
Subject:   Re: svn commit: r238755 - head/sys/x86/x86
Message-ID:  <20120726104918.GW2676@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120726170837.Q2536@besplex.bde.org>
References:  <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <500FE6AE.8070706@FreeBSD.org> <20120726001659.M5406@besplex.bde.org> <50102C94.9030706@FreeBSD.org> <20120725180537.GO2676@deviant.kiev.zoral.com.ua> <50103C61.8040904@FreeBSD.org> <20120726170837.Q2536@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--TU7weI3G/zBf4KwC
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 26, 2012 at 05:35:23PM +1000, Bruce Evans wrote:
> On Wed, 25 Jul 2012, Jung-uk Kim wrote:
> >>For some unrelated reasons, we do have lfence;rdtsc sequence in
> >>the userland already. Well, it is not exactly such sequence, there
> >>are some instructions between, but the main fact is that two
> >>consequtive invocations of gettimeofday(2) (*) or clock_gettime(2)
> >>are interleaved with lfence on Intels, guaranteeing that backstep
> >>of the counter is impossible.
>=20
> In fact, there is always a full documented serialization instruction
> for syscalls, except maybe in FreeBSD-1 compat code on i386, at
> least on Athlon64.  i386 syscalls use int 0x80 (except in FreeBSD-1
> compat code they use lcalls, and the iret necessary to return from
> this is serializing on at least Athlon64.  amd64 syscalls use
> sysenter/sysret.  sysret isn't serializing (like far returns), at least
> on Athlon64, but at least in FreeBSD, the syscall implementation uses
> at least 2 swapgs's (one on entry and one just before the sysret), and
> swapgs is serializing, at least on Athlon64.
Yes, SWAPGS is not documented as serializing on Intels. I reviewed
the whole syscall sequence for e.g. gettimeofday(2), and there is no
serialization point for fast path. E.g. ast would add locking and thus
serialization, as well as return by IRET, but fast path on amd64 has
no such things.

>=20
> >>* - it is not a syscall anymore.
> >>
> >>As I said, using recommended mfence;rdtsc sequence for AMDs would
> >>require some work, but lets handle the kernel and userspace issues
> >>separately.
>=20
> Benchmarks for various methods on AthlonXP: I started with a program
> that loops making a fe million clock_gettime() calls:
>=20
>     unchanged program: 1.15 seconds
>     add lfence:        1.16 seconds
>     add mfence:        1.15 seconds (yes, faster than mfence)
>     add atomic_cmpset: 1.20 seconds
>     add cpuid:         1.25 seconds
>=20
> >>And, I really failed to find what the patch from the thread you
> >>referenced tried to fix.
> >
> >The patch was supposed to reduce a barrier, i.e., vsyscall
> >optimization.  Please note I brought it up at the time, not because it
> >fixed any problem but because we completely lack necessary serialization.
> >
> >>Was it really committed into Linux ?
> >
> >Yes, it was committed in a simpler form:
> >
> >http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux.git;a=3Dcommi=
tdiff;h=3D057e6a8c660e95c3f4e7162e00e2fee1fc90c50d
> >
> >This function was moved around from time to time and now it sits here:
> >
> >http://git.kernel.org/?p=3Dlinux/kernel/git/torvalds/linux.git;a=3Dblob_=
plain;f=3Darch/x86/vdso/vclock_gettime.c
> >
> >It still carries one barrier before rdtsc.  Please see the comments.
>=20
> For safety, you probably need to use the slowest (cpuid) method.  Linux
> seems to be just using fences that are observed to work.
No, there is explicit mention of the recommended barriers in the vendor
documentation, which is LFENCE for Intels, and MFENCE for AMDs. My patch
just follows what is suggested in documentation.

>=20
> Original Athlon64 manuals say this about rdtsc: "... not serializing...
> even when bound by serializing instructions, the system environment at
> the time the instruction is executed can cause additional cycles
> [before it reaches EDX:EAX]".
Both Intel and AMD current manuals state that RDTSC is not serializing.
RDTSCP is documented by AMD as "forces all older instructions to retire
before reading the time-stamp counter." Intel says essentially the same.

[Replying to other mail in-place, the thread goes wild]

On Thu, Jul 26, 2012 at 04:25:01PM +1000, Bruce Evans wrote:
> On Wed, 25 Jul 2012, Konstantin Belousov wrote:
>=20
> >On Wed, Jul 25, 2012 at 11:00:41AM -0700, Jim Harris wrote:
> >>I wonder if instead of timecounter going backward, that TSC test
> >>fails because CPU speculatively performs rdtsc instruction in relation
> >>to waiter checks in smp_rendezvous_action.  Or maybe we are saying
> >>the same thing.
> >
> >Ok, the definition of the 'timecounter goes back', as I understand it:
> >
> >you have two events A and B in two threads, provable ordered, say, A is
> >a lock release and B is the same lock acquisition. Assume that you take
> >rdtsc values tA and tB under the scope of the lock right before A and
> >right after B. Then it should be impossible to have tA > tB.
>=20
> For the threaded case, there has to something for the accesses to be
> provably ordered.  It is hard to see how the something can be strong
> enough unless it serializes all thread state in A and B.  The rdtsc
> state is not part of the thread state as know to APIs, but it is hard
> to see how threads can serialize themselves without also serializing
> the TSC.
TSC timer read is not synchronized, and I found the Linux test for the
thing I described above. Adopted version is available at
http://people.freebsd.org/~kib/misc/time-warp-test.c.
It shall be compiled in 32bit mode only.

The code does full lock/unlock around RDTSC. Please note that there is
CPUID instruction commented out in __rdtscll().

On my Nehalem workstation, I get enormous amount of wraps reported for
RDTSC without CPUID. Adding CPUID back fixes the issue. So at least on
Nehalems (and probably Westmere, I will test later today) RDTSC can even
pass LOCKed instructions.

Curiously enough, SandyBridge is sane and reports zero wraps, it seems
Intel fixed the bug.

>=20
> For most uses, the scope of the serialization and locking also needs
> to extend across multiple timer reads.  Otherwise you can have situations
> like:
>=20
> 	read the time
> 		interrupt or context switch
> 			read later time in other intr handler/thread
> 			save late time
> 		back to previous context
> 	save earlier time
>=20
> It is unclear how to even prevent such situations.  You (at least, I)
> don't want heavyweight locking/synchronization to prevent the context
> switches.  And the kernel rarely if ever does such synchronization.
> binuptime() has none internally.  It just spins if necessary until the
> read becomes stable.  Most callers of binuptime() just call it.
>=20
> >I do not think that we can ever observe tA > tB if both threads are
> >executing on the same CPU.
>=20
> I thought that that was the problem, with a single thread and no context
> switches seeing the TSC go backwards.  Even then, it would take
> non-useful behaviour (except for calibration and benchmarks) like
> spinning executing rdtsc to see it going backwards.  Normally there
> are many instructions between rdtsc's and the non-serialization isn't
> as deep as that.  Using syscalls, you just can't read the timecounter
> without about 1000 cycles between reads.  When there is a context switch,
> there is usually accidental serialization from locking.
>=20
> I care about timestamps being ordered more than most people, and tried
> to kill the get*time() APIs because they are weakly ordered relative
> to the non-get variants (they return times in the past, and there is
> no way to round down to get consistent times).  I tried to fix them
> by adding locking and updating them to the latest time whenever a
> non-get variant gives a later time (by being used).  This was too slow,
> and breaks the design criteria that timecounter calls should not use
> any explicit locking.  However, if you want slowness, then you can get
> it similarly by fixing the monotonicity of rdtsc in software.  I think
> I just figured out how to do this with the same slowness as serialization,
> if a locked instruction serialzes; maybe less otherwise:
>=20
> spin:
> 	ptsc =3D prev_tsc;	/* memory -> local (intentionally !atomic) */
> 	tsc =3D rdtsc();		/* only 32 bits for timecounters */
> 	if (tsc <=3D ptsc) {	/* I forgot about wrap at first -- see below=20
> 	*/
> 		/*
> 		 * It went backwards, or stopped.  Could handle more
> 		 * completely, starting with panic() to see if this
> 		 * happens at all.
> 		 */
> 		return (ptsc);	/* stopped is better than backwards */
> 	}
> 	/* Usual case; update (32 bits). */
> 	if (atomic_cmpset_int(&prev_tsc, ptsc, tsc))
> 		return (tsc);
> 	goto spin;
I do not understand this. Algorithm is clear, but what you propose is
very heavy-weight comparing with adding just LFENCE or MFENCE before rdtsc.
First, the cache-line for prev_tsc becomes heavy-contended. Second, CAS
is expensive. LFENCE is fully local to the core it executes on.

--TU7weI3G/zBf4KwC
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAlARIK0ACgkQC3+MBN1Mb4i7AACcCQIJt0k5D+2KqvPb7E9WoGQK
kZQAmwS0TPTEqoCTVJMGCCux4wZBUtU/
=N9HI
-----END PGP SIGNATURE-----

--TU7weI3G/zBf4KwC--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120726104918.GW2676>