Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jul 2012 20:32:12 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        svn-src-head@freebsd.org, Jim Harris <jimharris@freebsd.org>, svn-src-all@freebsd.org, src-committers@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject:   Re: svn commit: r238755 - head/sys/x86/x86
Message-ID:  <20120725173212.GN2676@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120725233033.N5406@besplex.bde.org>
References:  <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <20120725233033.N5406@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

--ruMf52VDn1nA85Y/
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jul 26, 2012 at 12:15:54AM +1000, Bruce Evans wrote:
> On Wed, 25 Jul 2012, Konstantin Belousov wrote:
>=20
> >On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
> >>on 25/07/2012 01:10 Jim Harris said the following:
> >>>Author: jimharris
> >>>Date: Tue Jul 24 22:10:11 2012
> >>>New Revision: 238755
> >>>URL: http://svn.freebsd.org/changeset/base/238755
> >>>
> >>>Log:
> >>>  Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
> >>>
> >>>  Intel Architecture Manual specifies that rdtsc instruction is not=20
> >>>  serialized,
> >>>  so without this change, TSC synchronization test would periodically=
=20
> >>>  fail,
> >>>  resulting in use of HPET timecounter instead of TSC-low.  This caused
> >>>  severe performance degradation (40-50%) when running high IO/s=20
> >>>  workloads due to
> >>>  HPET MMIO reads and GEOM stat collection.
> >>>
> >>>  Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC=20
> >>>  synchronization
> >>>  fail approximately 20% of the time.
> >>
> >>Should rather the synchronization test be fixed if it's the culprit?
> >Synchronization test for what ?
> >
> >>Or is this change universally good for the real uses of TSC?
>=20
> It's too slow for real uses.  But synchronization code, and some uses
> that requires serialization may need it for, er, synchronization and
> serialization.
>=20
> It's hard to think of many uses that need serialization.  I often use
> it for timing instructions.  For timng a large number of instructions,
> serialization doesn't matter since errors of a few tens in a few billion
> done matter.  For timing a small number of instructions, I don't want
> serialization, since the serialization invalidates the timing.
>=20
> Most uses in FreeBSD are for timecounters.  Timecounters deliver the
> current time.  This is unrelated to whatever instructions haven't
> completed when the TSC is read.  Except possibly when the time needs
> to be synchronized across CPUs, and when the uncompleted instruction
> is a TSC read.
>=20
> >For tsc test, this means that after the change RDTSC executions are not
> >reordered on the single core among themself. As I understand, CPU has
> >no dependency noted between two reads of tsc by RDTSC, which allows
> >later read to give lower value of counter.
>=20
> Gak.  Even when they are in the same instruction sequence?  Even though
> the TSC reads fixed registers and some other instructions in the sequence
> between the TSC use these registers?  The CPU would have to do significant
> register renaming to break this.
As I could only speculate, I believe that any modern CPU executes RDTSC
as at least two separate steps, one is read from internal counter, and
second is the registers update. It seems that the first kind of action
is not serialized. I have no other explanation for the Jim findings.

I also asked Jim to test whether the cause the TSC sync test failure
is the lack of synchronization between gathering data and tasting it,
but ut appeared that the reason is genuine timecounter value going
backward.

Sp the bug seems real, and I cannot imagine we will live with the known
defect in timecounters which can step back.
>=20
> >This is fixed by Intel by
> >introduction of RDTSCP instruction, which is defined to be serialization
> >point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
> >test, as confirmed by Jim.
>=20
> This is not a fix if it is full serialization.  It just gives slowness
> using a single instruction instead of a couple.
>=20
> >In fact, I now think that we should also apply the following patch.
> >Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
> >time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
> >tsc reads, which was done not for this reason, but apparently needed for
> >the reason too.
> >
> >diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> >index 085c339..229b351 100644
> >--- a/sys/x86/x86/tsc.c
> >+++ b/sys/x86/x86/tsc.c
> >@@ -594,6 +594,7 @@ static u_int
> >tsc_get_timecount(struct timecounter *tc __unused)
> >{
> >
> >+	rmb();
> >	return (rdtsc32());
> >}
>=20
> Please don't pessimize this further.  The time for rdtsc went from 6.5
> cycles on AthlonXP to 65 cycles on core2 (mainly for for
> P-state-invariance hardware synchronization I think).  Pretty soon it
> will be as slow as an HPET and heading towards an i8254.  Adding rmb()
> only makes it 12 cycles slower on core2, but 16 cycles (almost 3 times)
> slower on AthlonXP.
AthlonXP does not look as interesting target for optimizations. Fom what I
can find this is PIII-era CPU.

>=20
> >@@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
> >{
> >	uint32_t rv;
> >
> >+	rmb();
> >	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> >-	: "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> >+	    : "=3Da" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> >	return (rv);
> >}
>=20
> The previous TSC-low/shrd pessimization adds only 2 cycles on AthlonXP
> and core2.  I think it only "works" by backing the TSC's resolution
> so low that it usually can't see its own, or at least other TSC's lack of
> serialness.  The shift count is usually 7 or 8, so the resolution is
> reduced from 1 cycle to 128 or 256.  Out of order times that fall in
> the same block of 128 or 256 cycles would appear to be the same, but
> out of order times like 129 and 127 would apear to be even more out
> of order after a shift of 7 turns them into 128 and 0.
>=20
> Bruce

--ruMf52VDn1nA85Y/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAlAQLZwACgkQC3+MBN1Mb4i3zgCg8c5jgAPQcRXWCRODzG+BM4N5
ilcAoOxpsYuxNbviGLhASp7vhe6C0Xw5
=g12l
-----END PGP SIGNATURE-----

--ruMf52VDn1nA85Y/--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120725173212.GN2676>