Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jul 2012 16:37:54 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        svn-src-head@freebsd.org, Jim Harris <jimharris@freebsd.org>, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r238755 - head/sys/x86/x86
Message-ID:  <20120725133754.GK2676@deviant.kiev.zoral.com.ua>
In-Reply-To: <500FE6AE.8070706@FreeBSD.org>
References:  <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <500FE6AE.8070706@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--0QJ8DMHdp6MMZYsr
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Wed, Jul 25, 2012 at 03:29:34PM +0300, Andriy Gapon wrote:
> on 25/07/2012 13:21 Konstantin Belousov said the following:
> > On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
> >> on 25/07/2012 01:10 Jim Harris said the following:
> >>> Author: jimharris
> >>> Date: Tue Jul 24 22:10:11 2012
> >>> New Revision: 238755
> >>> URL: http://svn.freebsd.org/changeset/base/238755
> >>>
> >>> Log:
> >>>   Add rmb() to tsc_read_##x to enforce serialization of rdtsc capture=
s.
> >>>  =20
> >>>   Intel Architecture Manual specifies that rdtsc instruction is not s=
erialized,
> >>>   so without this change, TSC synchronization test would periodically=
 fail,
> >>>   resulting in use of HPET timecounter instead of TSC-low.  This caus=
ed
> >>>   severe performance degradation (40-50%) when running high IO/s work=
loads due to
> >>>   HPET MMIO reads and GEOM stat collection.
> >>>  =20
> >>>   Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC syn=
chronization
> >>>   fail approximately 20% of the time.
> >>
> >> Should rather the synchronization test be fixed if it's the culprit?
> > Synchronization test for what ?
>=20
> The synchronization test mentioned above.
> So, oops, very sorry - I missed the fact that the change was precisely in=
 the
> test.  I confused it for another place where tsc is used.  Thank you for =
pointing
> this out.
>=20
> >> Or is this change universally good for the real uses of TSC?
> >=20
> > What I understood from the Intel SDM, and also from additional experime=
nts
> > which Jim kindly made despite me being annoying as usual, is that 'read
> > memory barrier' AKA LFENCE there is used for its secondary implementati=
on
> > effects, not for load/load barrier as you might assume.
> >=20
> > According to SDM, LFENCE fully drains execution pipeline (but comparing
> > with MFENCE, does not drain write buffers). The result is that RDTSC is
> > not started before previous instructions are finished.
>=20
> Yes, I am fully aware of this.
>=20
> > For tsc test, this means that after the change RDTSC executions are not
> > reordered on the single core among themself. As I understand, CPU has
> > no dependency noted between two reads of tsc by RDTSC, which allows
> > later read to give lower value of counter. This is fixed by Intel by
> > introduction of RDTSCP instruction, which is defined to be serialization
> > point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
> > test, as confirmed by Jim.
>=20
> Yes.  I think that previously Intel recommended to precede rdtsc with cpu=
id for
> all the same reasons.  Not sure if there is any difference performance-wi=
se
> comparing to lfence.
> Unfortunately, rdtscp is not available on all CPUs, so using it would req=
uire
> extra work.
>=20
> > In fact, I now think that we should also apply the following patch.
> > Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
> > time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
> > tsc reads, which was done not for this reason, but apparently needed for
> > the reason too.
> >=20
> > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> > index 085c339..229b351 100644
> > --- a/sys/x86/x86/tsc.c
> > +++ b/sys/x86/x86/tsc.c
> > @@ -594,6 +594,7 @@ static u_int
> >  tsc_get_timecount(struct timecounter *tc __unused)
> >  {
> > =20
> > +	rmb();
> >  	return (rdtsc32());
> >  }
> > =20
>=20
> This makes sense to me.  We probably want correctness over performance he=
re.
> [BTW, I originally thought that the change was here; brain malfunction]
>=20
> > @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
> >  {
> >  	uint32_t rv;
> > =20
> > +	rmb();
> >  	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> > -	: "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> > +	    : "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> >  	return (rv);
> >  }
> > =20
>=20
> It would correct here too, but not sure if it would make any difference g=
iven that
> some lower bits are discarded anyway.  Probably depends on exact CPU.
>=20
>=20
> And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h
> Processors and they suggest using cpuid (with a note that it may be
> intercepted in virtualized environments) or _mfence_ in the discussed
> role (Appendix F of the document). Googling for 'rdtsc mfence lfence'
> yields some interesting results.
Yes, MFENCE for AMD.

Since I was infected with these Google results anyway, I looked at the
Linux code. Apparently, they use MFENCE on amd, and LFENCE on Intel.
They also use LFENCE on VIA, it seems. Intel documentation claims that
MFENCE does not serialize instruction execution, which is contrary to
used LFENCE behaviour.

So we definitely want to add some barrier right before rdtsc. And we do
want LFENCE for Intels. Patch below ends with the following code:

Dump of assembler code for function tsc_get_timecount_lfence:
   0xffffffff805563a0 <+0>:     push   %rbp
   0xffffffff805563a1 <+1>:     mov    %rsp,%rbp
   0xffffffff805563a4 <+4>:     lfence=20
   0xffffffff805563a7 <+7>:     rdtsc =20
   0xffffffff805563a9 <+9>:     leaveq=20
   0xffffffff805563aa <+10>:    retq  =20
Dump of assembler code for function tsc_get_timecount_low_lfence:
   0xffffffff805556a0 <+0>:     push   %rbp
   0xffffffff805556a1 <+1>:     mov    %rsp,%rbp
   0xffffffff805556a4 <+4>:     lfence=20
   0xffffffff805556a7 <+7>:     mov    0x30(%rdi),%rcx
   0xffffffff805556ab <+11>:    rdtsc =20
   0xffffffff805556ad <+13>:    shrd   %cl,%edx,%eax
   0xffffffff805556b0 <+16>:    leaveq=20
   0xffffffff805556b1 <+17>:    retq=20
which I would qualify as perfect outcome.

Patch below still leaves libc only suitable for Intels and VIA, not for AMD,
but I will take care of it later.

diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
index c253a96..06cfe10 100644
--- a/sys/x86/x86/tsc.c
+++ b/sys/x86/x86/tsc.c
@@ -83,6 +83,10 @@ static void tsc_freq_changing(void *arg, const struct cf=
_level *level,
     int *status);
 static unsigned tsc_get_timecount(struct timecounter *tc);
 static unsigned tsc_get_timecount_low(struct timecounter *tc);
+static unsigned tsc_get_timecount_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_mfence(struct timecounter *tc);
+static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc);
 static void tsc_levels_changed(void *arg, int unit);
=20
 static struct timecounter tsc_timecounter =3D {
@@ -262,6 +266,10 @@ probe_tsc_freq(void)
 		    (vm_guest =3D=3D VM_GUEST_NO &&
 		    CPUID_TO_FAMILY(cpu_id) >=3D 0x10))
 			tsc_is_invariant =3D 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =3D
+			    tsc_get_timecount_mfence;
+		}
 		break;
 	case CPU_VENDOR_INTEL:
 		if ((amd_pminfo & AMDPM_TSC_INVARIANT) !=3D 0 ||
@@ -271,6 +279,10 @@ probe_tsc_freq(void)
 		    (CPUID_TO_FAMILY(cpu_id) =3D=3D 0xf &&
 		    CPUID_TO_MODEL(cpu_id) >=3D 0x3))))
 			tsc_is_invariant =3D 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =3D
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	case CPU_VENDOR_CENTAUR:
 		if (vm_guest =3D=3D VM_GUEST_NO &&
@@ -278,6 +290,10 @@ probe_tsc_freq(void)
 		    CPUID_TO_MODEL(cpu_id) >=3D 0xf &&
 		    (rdmsr(0x1203) & 0x100000000ULL) =3D=3D 0)
 			tsc_is_invariant =3D 1;
+		if (cpu_feature & CPUID_SSE2) {
+			tsc_timecounter.tc_get_timecount =3D
+			    tsc_get_timecount_lfence;
+		}
 		break;
 	}
=20
@@ -328,7 +344,15 @@ init_TSC(void)
=20
 #ifdef SMP
=20
-/* rmb is required here because rdtsc is not a serializing instruction. */
+/*
+ * RDTSC is not a serializing instruction, so we need to drain
+ * instruction stream before executing it. It could be fixed by use of
+ * RDTSCP, except the instruction is not available everywhere.
+ *
+ * Insert both MFENCE for AMD CPUs, and LFENCE for others (Intel and
+ * VIA), and assume that SMP test is only performed on CPUs that have
+ * SSE2 anyway.
+ */
 #define	TSC_READ(x)			\
 static void				\
 tsc_read_##x(void *arg)			\
@@ -337,6 +361,7 @@ tsc_read_##x(void *arg)			\
 	u_int cpu =3D PCPU_GET(cpuid);	\
 					\
 	rmb();				\
+	mb();				\
 	tsc[cpu * 3 + x] =3D rdtsc32();	\
 }
 TSC_READ(0)
@@ -487,7 +512,16 @@ init:
 	for (shift =3D 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
 		;
 	if (shift > 0) {
-		tsc_timecounter.tc_get_timecount =3D tsc_get_timecount_low;
+		if (cpu_feature & CPUID_SSE2) {
+			if (cpu_vendor_id =3D=3D CPU_VENDOR_AMD) {
+				tsc_timecounter.tc_get_timecount =3D
+				    tsc_get_timecount_low_mfence;
+			} else {
+				tsc_timecounter.tc_get_timecount =3D
+				    tsc_get_timecount_low_lfence;
+			}
+		} else
+			tsc_timecounter.tc_get_timecount =3D tsc_get_timecount_low;
 		tsc_timecounter.tc_name =3D "TSC-low";
 		if (bootverbose)
 			printf("TSC timecounter discards lower %d bit(s)\n",
@@ -592,23 +626,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
 SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
     0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency");
=20
-static u_int
+static inline u_int
 tsc_get_timecount(struct timecounter *tc __unused)
 {
=20
 	return (rdtsc32());
 }
=20
-static u_int
+static inline u_int
 tsc_get_timecount_low(struct timecounter *tc)
 {
 	uint32_t rv;
=20
 	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
-	: "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
+	    : "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
 	return (rv);
 }
=20
+static inline u_int
+tsc_get_timecount_lfence(struct timecounter *tc __unused)
+{
+
+	rmb();
+	return (rdtsc32());
+}
+
+static inline u_int
+tsc_get_timecount_low_lfence(struct timecounter *tc)
+{
+
+	rmb();
+	return (tsc_get_timecount_low(tc));
+}
+
+static inline u_int
+tsc_get_timecount_mfence(struct timecounter *tc __unused)
+{
+
+	mb();
+	return (rdtsc32());
+}
+
+static inline u_int
+tsc_get_timecount_low_mfence(struct timecounter *tc)
+{
+
+	mb();
+	return (tsc_get_timecount_low(tc));
+}
+
 uint32_t
 cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
 {

--0QJ8DMHdp6MMZYsr
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAlAP9rIACgkQC3+MBN1Mb4jVcQCgubtMzsGjHtDAnAHyeH4Mhu6F
PjoAoKAmcKaMMerzmF3Cs2g7IRA/6yvO
=PXg8
-----END PGP SIGNATURE-----

--0QJ8DMHdp6MMZYsr--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120725133754.GK2676>