From owner-svn-src-all@FreeBSD.ORG Wed Jul 25 15:29:59 2012 Return-Path: Delivered-To: svn-src-all@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 728FA1065670; Wed, 25 Jul 2012 15:29:59 +0000 (UTC) (envelope-from jim.harris@gmail.com) Received: from mail-wg0-f50.google.com (mail-wg0-f50.google.com [74.125.82.50]) by mx1.freebsd.org (Postfix) with ESMTP id 77C6C8FC0C; Wed, 25 Jul 2012 15:29:58 +0000 (UTC) Received: by wgbds11 with SMTP id ds11so833228wgb.31 for ; Wed, 25 Jul 2012 08:29:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=lTjsmjHxLEMcE2byPuhgIijBjwUEZIjXtw010Q3vR4c=; b=jvq3mi8VIsy2oFJaKsPTqY8M1HOopAVQw9ePNq2z+aNH4M1DBbPS+2cks7RTeo56bF rVl0HcLoisYjpnCikwitSexlUS/CNSBxzm52E7OdgTk/hxiadOROVNKJ/kVfMZJ1sGWb YHo1IDsDXaHejxmDKzTkm22HIsHhkhwnJ5y+vx7WI1PgOQO1CNPYS9fyORjYd4cIxbaW ICPqYukNBj/ghXYEF6c35IxXva+J1SSyEO3cFcSBaMkmytNmcK08mTa1rl/8bmFhZSq0 7j0xz6YSNIjsKwxBY4ziPvUV1MNApkSAlroCGPCF8ij2XxXBYhqges0H/+F1cfrIuzKW VpAA== MIME-Version: 1.0 Received: by 10.216.181.195 with SMTP id l45mr5117628wem.52.1343230197420; Wed, 25 Jul 2012 08:29:57 -0700 (PDT) Received: by 10.216.178.212 with HTTP; Wed, 25 Jul 2012 08:29:57 -0700 (PDT) In-Reply-To: <20120725133754.GK2676@deviant.kiev.zoral.com.ua> References: <201207242210.q6OMACqV079603@svn.freebsd.org> <500F9E22.4080608@FreeBSD.org> <20120725102130.GH2676@deviant.kiev.zoral.com.ua> <500FE6AE.8070706@FreeBSD.org> <20120725133754.GK2676@deviant.kiev.zoral.com.ua> Date: Wed, 25 Jul 2012 08:29:57 -0700 Message-ID: From: Jim Harris To: Konstantin Belousov Content-Type: text/plain; charset=ISO-8859-1 Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org, Andriy Gapon Subject: Re: svn commit: r238755 - head/sys/x86/x86 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 25 Jul 2012 15:29:59 -0000 On Wed, Jul 25, 2012 at 6:37 AM, Konstantin Belousov wrote: > On Wed, Jul 25, 2012 at 03:29:34PM +0300, Andriy Gapon wrote: >> on 25/07/2012 13:21 Konstantin Belousov said the following: >> > On Wed, Jul 25, 2012 at 10:20:02AM +0300, Andriy Gapon wrote: >> >> on 25/07/2012 01:10 Jim Harris said the following: >> >>> Author: jimharris >> >>> Date: Tue Jul 24 22:10:11 2012 >> >>> New Revision: 238755 >> >>> URL: http://svn.freebsd.org/changeset/base/238755 >> >>> >> >>> Log: >> >>> Add rmb() to tsc_read_##x to enforce serialization of rdtsc captures. >> >>> >> >>> Intel Architecture Manual specifies that rdtsc instruction is not serialized, >> >>> so without this change, TSC synchronization test would periodically fail, >> >>> resulting in use of HPET timecounter instead of TSC-low. This caused >> >>> severe performance degradation (40-50%) when running high IO/s workloads due to >> >>> HPET MMIO reads and GEOM stat collection. >> >>> >> >>> Tests on Xeon E5-2600 (Sandy Bridge) 8C systems were seeing TSC synchronization >> >>> fail approximately 20% of the time. >> >> >> >> Should rather the synchronization test be fixed if it's the culprit? >> > Synchronization test for what ? >> >> The synchronization test mentioned above. >> So, oops, very sorry - I missed the fact that the change was precisely in the >> test. I confused it for another place where tsc is used. Thank you for pointing >> this out. >> >> >> Or is this change universally good for the real uses of TSC? >> > >> > What I understood from the Intel SDM, and also from additional experiments >> > which Jim kindly made despite me being annoying as usual, is that 'read >> > memory barrier' AKA LFENCE there is used for its secondary implementation >> > effects, not for load/load barrier as you might assume. >> > >> > According to SDM, LFENCE fully drains execution pipeline (but comparing >> > with MFENCE, does not drain write buffers). The result is that RDTSC is >> > not started before previous instructions are finished. >> >> Yes, I am fully aware of this. >> >> > For tsc test, this means that after the change RDTSC executions are not >> > reordered on the single core among themself. As I understand, CPU has >> > no dependency noted between two reads of tsc by RDTSC, which allows >> > later read to give lower value of counter. This is fixed by Intel by >> > introduction of RDTSCP instruction, which is defined to be serialization >> > point, and use of which (instead of LFENCE; RDTSC sequence) also fixes >> > test, as confirmed by Jim. >> >> Yes. I think that previously Intel recommended to precede rdtsc with cpuid for >> all the same reasons. Not sure if there is any difference performance-wise >> comparing to lfence. >> Unfortunately, rdtscp is not available on all CPUs, so using it would require >> extra work. >> >> > In fact, I now think that we should also apply the following patch. >> > Otherwise, consequtive calls to e.g. binuptime(9) could return decreased >> > time stamps. Note that libc __vdso_gettc.c already has LFENCE nearby the >> > tsc reads, which was done not for this reason, but apparently needed for >> > the reason too. >> > >> > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c >> > index 085c339..229b351 100644 >> > --- a/sys/x86/x86/tsc.c >> > +++ b/sys/x86/x86/tsc.c >> > @@ -594,6 +594,7 @@ static u_int >> > tsc_get_timecount(struct timecounter *tc __unused) >> > { >> > >> > + rmb(); >> > return (rdtsc32()); >> > } >> > >> >> This makes sense to me. We probably want correctness over performance here. >> [BTW, I originally thought that the change was here; brain malfunction] >> >> > @@ -602,8 +603,9 @@ tsc_get_timecount_low(struct timecounter *tc) >> > { >> > uint32_t rv; >> > >> > + rmb(); >> > __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" >> > - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); >> > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); >> > return (rv); >> > } >> > >> >> It would correct here too, but not sure if it would make any difference given that >> some lower bits are discarded anyway. Probably depends on exact CPU. >> >> >> And, oh hmm, I read AMD Software Optimization Guide for AMD Family 10h >> Processors and they suggest using cpuid (with a note that it may be >> intercepted in virtualized environments) or _mfence_ in the discussed >> role (Appendix F of the document). Googling for 'rdtsc mfence lfence' >> yields some interesting results. > Yes, MFENCE for AMD. > > Since I was infected with these Google results anyway, I looked at the > Linux code. Apparently, they use MFENCE on amd, and LFENCE on Intel. > They also use LFENCE on VIA, it seems. Intel documentation claims that > MFENCE does not serialize instruction execution, which is contrary to > used LFENCE behaviour. > > So we definitely want to add some barrier right before rdtsc. And we do > want LFENCE for Intels. Patch below ends with the following code: > > Dump of assembler code for function tsc_get_timecount_lfence: > 0xffffffff805563a0 <+0>: push %rbp > 0xffffffff805563a1 <+1>: mov %rsp,%rbp > 0xffffffff805563a4 <+4>: lfence > 0xffffffff805563a7 <+7>: rdtsc > 0xffffffff805563a9 <+9>: leaveq > 0xffffffff805563aa <+10>: retq > Dump of assembler code for function tsc_get_timecount_low_lfence: > 0xffffffff805556a0 <+0>: push %rbp > 0xffffffff805556a1 <+1>: mov %rsp,%rbp > 0xffffffff805556a4 <+4>: lfence > 0xffffffff805556a7 <+7>: mov 0x30(%rdi),%rcx > 0xffffffff805556ab <+11>: rdtsc > 0xffffffff805556ad <+13>: shrd %cl,%edx,%eax > 0xffffffff805556b0 <+16>: leaveq > 0xffffffff805556b1 <+17>: retq > which I would qualify as perfect outcome. > > Patch below still leaves libc only suitable for Intels and VIA, not for AMD, > but I will take care of it later. > > diff --git a/sys/x86/x86/tsc.c b/sys/x86/x86/tsc.c > index c253a96..06cfe10 100644 > --- a/sys/x86/x86/tsc.c > +++ b/sys/x86/x86/tsc.c > @@ -83,6 +83,10 @@ static void tsc_freq_changing(void *arg, const struct cf_level *level, > int *status); > static unsigned tsc_get_timecount(struct timecounter *tc); > static unsigned tsc_get_timecount_low(struct timecounter *tc); > +static unsigned tsc_get_timecount_lfence(struct timecounter *tc); > +static unsigned tsc_get_timecount_low_lfence(struct timecounter *tc); > +static unsigned tsc_get_timecount_mfence(struct timecounter *tc); > +static unsigned tsc_get_timecount_low_mfence(struct timecounter *tc); > static void tsc_levels_changed(void *arg, int unit); > > static struct timecounter tsc_timecounter = { > @@ -262,6 +266,10 @@ probe_tsc_freq(void) > (vm_guest == VM_GUEST_NO && > CPUID_TO_FAMILY(cpu_id) >= 0x10)) > tsc_is_invariant = 1; > + if (cpu_feature & CPUID_SSE2) { > + tsc_timecounter.tc_get_timecount = > + tsc_get_timecount_mfence; > + } > break; > case CPU_VENDOR_INTEL: > if ((amd_pminfo & AMDPM_TSC_INVARIANT) != 0 || > @@ -271,6 +279,10 @@ probe_tsc_freq(void) > (CPUID_TO_FAMILY(cpu_id) == 0xf && > CPUID_TO_MODEL(cpu_id) >= 0x3)))) > tsc_is_invariant = 1; > + if (cpu_feature & CPUID_SSE2) { > + tsc_timecounter.tc_get_timecount = > + tsc_get_timecount_lfence; > + } > break; > case CPU_VENDOR_CENTAUR: > if (vm_guest == VM_GUEST_NO && > @@ -278,6 +290,10 @@ probe_tsc_freq(void) > CPUID_TO_MODEL(cpu_id) >= 0xf && > (rdmsr(0x1203) & 0x100000000ULL) == 0) > tsc_is_invariant = 1; > + if (cpu_feature & CPUID_SSE2) { > + tsc_timecounter.tc_get_timecount = > + tsc_get_timecount_lfence; > + } > break; > } > > @@ -328,7 +344,15 @@ init_TSC(void) > > #ifdef SMP > > -/* rmb is required here because rdtsc is not a serializing instruction. */ > +/* > + * RDTSC is not a serializing instruction, so we need to drain > + * instruction stream before executing it. It could be fixed by use of > + * RDTSCP, except the instruction is not available everywhere. > + * > + * Insert both MFENCE for AMD CPUs, and LFENCE for others (Intel and > + * VIA), and assume that SMP test is only performed on CPUs that have > + * SSE2 anyway. > + */ > #define TSC_READ(x) \ > static void \ > tsc_read_##x(void *arg) \ > @@ -337,6 +361,7 @@ tsc_read_##x(void *arg) \ > u_int cpu = PCPU_GET(cpuid); \ > \ > rmb(); \ > + mb(); \ > tsc[cpu * 3 + x] = rdtsc32(); \ I've seen bde@'s comments, so perhaps this patch will not move forward, but I'm wondering if it would make sense here to just call the new tsc_get_timecount_mfence() function rather than explicitly call mb() and then rdtsc32(). > } > TSC_READ(0) > @@ -487,7 +512,16 @@ init: > for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++) > ; > if (shift > 0) { > - tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; > + if (cpu_feature & CPUID_SSE2) { > + if (cpu_vendor_id == CPU_VENDOR_AMD) { > + tsc_timecounter.tc_get_timecount = > + tsc_get_timecount_low_mfence; > + } else { > + tsc_timecounter.tc_get_timecount = > + tsc_get_timecount_low_lfence; > + } > + } else > + tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; > tsc_timecounter.tc_name = "TSC-low"; > if (bootverbose) > printf("TSC timecounter discards lower %d bit(s)\n", > @@ -592,23 +626,55 @@ sysctl_machdep_tsc_freq(SYSCTL_HANDLER_ARGS) > SYSCTL_PROC(_machdep, OID_AUTO, tsc_freq, CTLTYPE_U64 | CTLFLAG_RW, > 0, 0, sysctl_machdep_tsc_freq, "QU", "Time Stamp Counter frequency"); > > -static u_int > +static inline u_int > tsc_get_timecount(struct timecounter *tc __unused) > { > > return (rdtsc32()); > } > > -static u_int > +static inline u_int > tsc_get_timecount_low(struct timecounter *tc) > { > uint32_t rv; > > __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" > - : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); > return (rv); > } > > +static inline u_int > +tsc_get_timecount_lfence(struct timecounter *tc __unused) > +{ > + > + rmb(); > + return (rdtsc32()); > +} > + > +static inline u_int > +tsc_get_timecount_low_lfence(struct timecounter *tc) > +{ > + > + rmb(); > + return (tsc_get_timecount_low(tc)); > +} > + > +static inline u_int > +tsc_get_timecount_mfence(struct timecounter *tc __unused) > +{ > + > + mb(); > + return (rdtsc32()); > +} > + > +static inline u_int > +tsc_get_timecount_low_mfence(struct timecounter *tc) > +{ > + > + mb(); > + return (tsc_get_timecount_low(tc)); > +} > + > uint32_t > cpu_fill_vdso_timehands(struct vdso_timehands *vdso_th) > {