Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Jul 2012 08:29:57 -0700
From:      Jim Harris <jim.harris@gmail.com>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andriy Gapon <avg@freebsd.org>
Subject:   Re: svn commit: r238755 - head/sys/x86/x86
Message-ID:  <CAJP=Hc_9ehs9OVemqYxWqQnasBF1VH8N4V3xLfJHuDeESSU08w@mail.gmail.com>
In-Reply-To: <20120725133754.GK2676@deviant.kiev.zoral.com.ua>
References:  <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <500FE6AE.8070706@FreeBSD.org> <20120725133754.GK2676@deviant.kiev.zoral.com.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Jul 25, 2012 at 6:37 AM, Konstantin Belousov
<kostikbel@gmail.com> wrote:
> On Wed, Jul 25, 2012 at 03:29:34PM +0300, Andriy Gapon wrote:
>> on 25/07/2012 13:21 Konstantin Belousov said the following:
>> > On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote:
>> >> on 25/07/2012 01:10 Jim Harris said the following:
>> >>> Author: jimharris
>> >>> Date: Tue Jul 24 22:10:11 2012
>> >>> New Revision: 238755
>> >>> URL: http://svn.freebsd.org/changeset/base/238755
>> >>>
>> >>> Log:
>> >>>   Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures.
>> >>>
>> >>>   Intel Architecture Manual specifies that rdtsc instruction is not serialized,
>> >>>   so without this change, TSC synchronization test would periodically fail,
>> >>>   resulting in use of HPET timecounter instead of TSC-low.  This caused
>> >>>   severe performance degradation (40-50%) when running high IO/s workloads due to
>> >>>   HPET MMIO reads and GEOM stat collection.
>> >>>
>> >>>   Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization
>> >>>   fail approximately 20% of the time.
>> >>
>> >> Should rather the synchronization test be fixed if it's the culprit?
>> > Synchronization test for what ?
>>
>> The synchronization test mentioned above.
>> So, oops, very sorry - I missed the fact that the change was precisely in the
>> test.  I confused it for another place where tsc is used.  Thank you for pointing
>> this out.
>>
>> >> Or is this change universally good for the real uses of TSC?
>> >
>> > What I understood from the Intel SDM, and also from additional experiments
>> > which Jim kindly made despite me being annoying as usual, is that 'read
>> > memory barrier' AKA LFENCE there is used for its secondary implementation
>> > effects, not for load/load barrier as you might assume.
>> >
>> > According to SDM, LFENCE fully drains execution pipeline (but comparing
>> > with MFENCE, does not drain write buffers). The result is that RDTSC is
>> > not started before previous instructions are finished.
>>
>> Yes, I am fully aware of this.
>>
>> > For tsc test, this means that after the change RDTSC executions are not
>> > reordered on the single core among themself. As I understand, CPU has
>> > no dependency noted between two reads of tsc by RDTSC, which allows
>> > later read to give lower value of counter. This is fixed by Intel by
>> > introduction of RDTSCP instruction, which is defined to be serialization
>> > point, and use of which (instead of LFENCE; RDTSC sequence) also fixes
>> > test, as confirmed by Jim.
>>
>> Yes.  I think that previously Intel recommended to precede rdtsc with cpuid for
>> all the same reasons.  Not sure if there is any difference performance-wise
>> comparing to lfence.
>> Unfortunately, rdtscp is not available on all CPUs, so using it would require
>> extra work.
>>
>> > In fact, I now think that we should also apply the following patch.
>> > Otherwise, consequtive calls to e.g. binuptime(9) could return decreased
>> > time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the
>> > tsc reads, which was done not for this reason, but apparently needed for
>> > the reason too.
>> >
>> > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
>> > index 085c339..229b351 100644
>> > --- a/sys/x86/x86/tsc.c
>> > +++ b/sys/x86/x86/tsc.c
>> > @@ -594,6 +594,7 @@ static u_int
>> >  tsc_get_timecount(struct timecounter *tc __unused)
>> >  {
>> >
>> > +   rmb();
>> >     return (rdtsc32());
>> >  }
>> >
>>
>> This makes sense to me.  We probably want correctness over performance here.
>> [BTW, I originally thought that the change was here; brain malfunction]
>>
>> > @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc)
>> >  {
>> >     uint32_t rv;
>> >
>> > +   rmb();
>> >     __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
>> > -   : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> > +       : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>> >     return (rv);
>> >  }
>> >
>>
>> It would correct here too, but not sure if it would make any difference given that
>> some lower bits are discarded anyway.  Probably depends on exact CPU.
>>
>>
>> And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h
>> Processors and they suggest using cpuid (with a note that it may be
>> intercepted in virtualized environments) or _mfence_ in the discussed
>> role (Appendix F of the document). Googling for 'rdtsc mfence lfence'
>> yields some interesting results.
> Yes, MFENCE for AMD.
>
> Since I was infected with these Google results anyway, I looked at the
> Linux code. Apparently, they use MFENCE on amd, and LFENCE on Intel.
> They also use LFENCE on VIA, it seems. Intel documentation claims that
> MFENCE does not serialize instruction execution, which is contrary to
> used LFENCE behaviour.
>
> So we definitely want to add some barrier right before rdtsc. And we do
> want LFENCE for Intels. Patch below ends with the following code:
>
> Dump of assembler code for function tsc_get_timecount_lfence:
>    0xffffffff805563a0 <+0>:     push   %rbp
>    0xffffffff805563a1 <+1>:     mov    %rsp,%rbp
>    0xffffffff805563a4 <+4>:     lfence
>    0xffffffff805563a7 <+7>:     rdtsc
>    0xffffffff805563a9 <+9>:     leaveq
>    0xffffffff805563aa <+10>:    retq
> Dump of assembler code for function tsc_get_timecount_low_lfence:
>    0xffffffff805556a0 <+0>:     push   %rbp
>    0xffffffff805556a1 <+1>:     mov    %rsp,%rbp
>    0xffffffff805556a4 <+4>:     lfence
>    0xffffffff805556a7 <+7>:     mov    0x30(%rdi),%rcx
>    0xffffffff805556ab <+11>:    rdtsc
>    0xffffffff805556ad <+13>:    shrd   %cl,%edx,%eax
>    0xffffffff805556b0 <+16>:    leaveq
>    0xffffffff805556b1 <+17>:    retq
> which I would qualify as perfect outcome.
>
> Patch below still leaves libc only suitable for Intels and VIA, not for AMD,
> but I will take care of it later.
>
> diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c
> index c253a96..06cfe10 100644
> --- a/sys/x86/x86/tsc.c
> +++ b/sys/x86/x86/tsc.c
> @@ -83,6 +83,10 @@ static void tsc_freq_changing(void *arg, const struct cf_level *level,
>      int *status);
>  static unsigned tsc_get_timecount(struct timecounter *tc);
>  static unsigned tsc_get_timecount_low(struct timecounter *tc);
> +static unsigned tsc_get_timecount_lfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_mfence(struct timecounter *tc);
> +static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc);
>  static void tsc_levels_changed(void *arg, int unit);
>
>  static struct timecounter tsc_timecounter = {
> @@ -262,6 +266,10 @@ probe_tsc_freq(void)
>                     (vm_guest == VM_GUEST_NO &&
>                     CPUID_TO_FAMILY(cpu_id) >= 0x10))
>                         tsc_is_invariant = 1;
> +               if (cpu_feature & CPUID_SSE2) {
> +                       tsc_timecounter.tc_get_timecount =
> +                           tsc_get_timecount_mfence;
> +               }
>                 break;
>         case CPU_VENDOR_INTEL:
>                 if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 ||
> @@ -271,6 +279,10 @@ probe_tsc_freq(void)
>                     (CPUID_TO_FAMILY(cpu_id) == 0xf &&
>                     CPUID_TO_MODEL(cpu_id) >= 0x3))))
>                         tsc_is_invariant = 1;
> +               if (cpu_feature & CPUID_SSE2) {
> +                       tsc_timecounter.tc_get_timecount =
> +                           tsc_get_timecount_lfence;
> +               }
>                 break;
>         case CPU_VENDOR_CENTAUR:
>                 if (vm_guest == VM_GUEST_NO &&
> @@ -278,6 +290,10 @@ probe_tsc_freq(void)
>                     CPUID_TO_MODEL(cpu_id) >= 0xf &&
>                     (rdmsr(0x1203) & 0x100000000ULL) == 0)
>                         tsc_is_invariant = 1;
> +               if (cpu_feature & CPUID_SSE2) {
> +                       tsc_timecounter.tc_get_timecount =
> +                           tsc_get_timecount_lfence;
> +               }
>                 break;
>         }
>
> @@ -328,7 +344,15 @@ init_TSC(void)
>
>  #ifdef SMP
>
> -/* rmb is required here because rdtsc is not a serializing instruction. */
> +/*
> + * RDTSC is not a serializing instruction, so we need to drain
> + * instruction stream before executing it. It could be fixed by use of
> + * RDTSCP, except the instruction is not available everywhere.
> + *
> + * Insert both MFENCE for AMD CPUs, and LFENCE for others (Intel and
> + * VIA), and assume that SMP test is only performed on CPUs that have
> + * SSE2 anyway.
> + */
>  #define        TSC_READ(x)                     \
>  static void                            \
>  tsc_read_##x(void *arg)                        \
> @@ -337,6 +361,7 @@ tsc_read_##x(void *arg)                     \
>         u_int cpu = PCPU_GET(cpuid);    \
>                                         \
>         rmb();                          \
> +       mb();                           \
>         tsc[cpu * 3 + x] = rdtsc32();   \

I've seen bde@'s comments, so perhaps this patch will not move
forward, but I'm wondering if it would make sense here to just call
the new tsc_get_timecount_mfence() function rather than explicitly
call mb() and then rdtsc32().

>  }
>  TSC_READ(0)
> @@ -487,7 +512,16 @@ init:
>         for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
>                 ;
>         if (shift > 0) {
> -               tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
> +               if (cpu_feature & CPUID_SSE2) {
> +                       if (cpu_vendor_id == CPU_VENDOR_AMD) {
> +                               tsc_timecounter.tc_get_timecount =
> +                                   tsc_get_timecount_low_mfence;
> +                       } else {
> +                               tsc_timecounter.tc_get_timecount =
> +                                   tsc_get_timecount_low_lfence;
> +                       }
> +               } else
> +                       tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;
>                 tsc_timecounter.tc_name = "TSC-low";
>                 if (bootverbose)
>                         printf("TSC timecounter discards lower %d bit(s)\n",
> @@ -592,23 +626,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS)
>  SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW,
>      0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency");
>
> -static u_int
> +static inline u_int
>  tsc_get_timecount(struct timecounter *tc __unused)
>  {
>
>         return (rdtsc32());
>  }
>
> -static u_int
> +static inline u_int
>  tsc_get_timecount_low(struct timecounter *tc)
>  {
>         uint32_t rv;
>
>         __asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> -       : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
> +           : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");
>         return (rv);
>  }
>
> +static inline u_int
> +tsc_get_timecount_lfence(struct timecounter *tc __unused)
> +{
> +
> +       rmb();
> +       return (rdtsc32());
> +}
> +
> +static inline u_int
> +tsc_get_timecount_low_lfence(struct timecounter *tc)
> +{
> +
> +       rmb();
> +       return (tsc_get_timecount_low(tc));
> +}
> +
> +static inline u_int
> +tsc_get_timecount_mfence(struct timecounter *tc __unused)
> +{
> +
> +       mb();
> +       return (rdtsc32());
> +}
> +
> +static inline u_int
> +tsc_get_timecount_low_mfence(struct timecounter *tc)
> +{
> +
> +       mb();
> +       return (tsc_get_timecount_low(tc));
> +}
> +
>  uint32_t
>  cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th)
>  {



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJP=Hc_9ehs9OVemqYxWqQnasBF1VH8N4V3xLfJHuDeESSU08w>