Date: Sat, 18 Jun 2011 21:08:06 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jung-uk Kim <jkim@FreeBSD.org> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org Subject: Re: svn commit: r223211 - head/sys/x86/x86 Message-ID: <20110618195655.M889@besplex.bde.org> In-Reply-To: <201106172141.p5HLf6Rx009154@svn.freebsd.org> References: <201106172141.p5HLf6Rx009154@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 17 Jun 2011, Jung-uk Kim wrote: > Log: > Teach the compiler how to shift TSC value efficiently. As noted in r220631, > some times compiler inserts redundant instructions to preserve unused upper > 32 bits even when it is casted to a 32-bit value. Unfortunately, it seems > the problem becomes more serious when it is shifted, especially on amd64. Er, I tried to point out how to optimize this code before (but didn't reply to your reply), and it's not by using more asm. > Modified: head/sys/x86/x86/tsc.c > ============================================================================== > --- head/sys/x86/x86/tsc.c Fri Jun 17 21:31:13 2011 (r223210) > +++ head/sys/x86/x86/tsc.c Fri Jun 17 21:41:06 2011 (r223211) > @@ -461,7 +461,7 @@ init_TSC_tc(void) > tsc_timecounter.tc_quality = 1000; > > init: > - for (shift = 0; shift < 32 && (tsc_freq >> shift) > max_freq; shift++) > + for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++) > ; > if (shift > 0) { > tsc_timecounter.tc_get_timecount = tsc_get_timecount_low; shift == 32 (or even shift == 31) is unreachable. A shift of 31 will shift 2GHz down to 1 Hz, or support physically impossible frequencies like 2**33 GHz. OTOH, shifts of up to 63 are supported by the slow gcc code. > @@ -579,6 +579,9 @@ tsc_get_timecount(struct timecounter *tc > static u_int > tsc_get_timecount_low(struct timecounter *tc) > { > + uint32_t rv; > > - return (rdtsc() >> (int)(intptr_t)tc->tc_priv); > + __asm __volatile("rdtsc; shrd %%cl, %%edx, %0" > + : "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx"); Lexical style bug (indentation of second line of the asm). > + return (rv); > } Just return the shift of the low 32 bits (and change tc_counter_mask to match) like I said. This loses only the accidental ability for the timecounter to work for more than a few seconds when interrupts are stopped by something like ddb, since any shift count that loses too many of the low 32 bits will not work for other reasons. For example, suppose that the TSC frequency is 8G-1Hz, which is unavailable except possible in research labs. This must be shifted by 1 to fit in 32 bits. If we use only the low 32 bits, then we end up with only 31 significant bits and tsc_get_timecount_low() wraps after ~2 seconds instead of after the best possible for this shift of ~4 seconds. If we shift by 7 more, as we do in the SMP case, then if we start with 32 bits then we end up with 24 bits, but the wrap still takes 2 seconds; if we start with 64 bits then we end up with 32 bits and the wrap takes 4*2**7 = 512 seconds. But wrap times longer than 1/HZ times a few are not needed. 2 seconds is already at least 100 or 1000 times longer than needed, depending on HZ. The case where the unscaled frequency is 4G-1Hz and !SMP gives a shift count of 0 and a wrap time of ~4 seconds. Whatever is done to make that case work (say, not allowing a fully tickless kernel with HZ = 0), works almost as well up to an unscaled frequency of 8GHz which is still far off. No one will notice these micro-optimizations, but I agree that the redundant instructions are ugly. I get the following on i386 for the original version with an old source tree: % #APP % rdtsc % #NO_APP % movl 8(%ebp), %ecx % movl 28(%ecx), %ecx % shrdl %edx, %eax % shrl %cl, %edx % testb $32, %cl % je .L3 % movl %edx, %eax % xorl %edx, %edx % .L3: The last 4 instructions are not redundant, but are needed to support shifts of up to 63 (maybe 64). I tried masking the shift count with 0x1f so that the shift count is known to be < 32, this just gave an extra instruction for the masking. It's even worse with rdtsc() converted to u_int first like I want: % movl %ebx, (%esp) % movl %esi, 4(%esp) % #APP % rdtsc % #NO_APP % movl %eax, %ebx % movl 8(%ebp), %eax % movl 4(%esp), %esi % movl 28(%eax), %ecx % movl %ebx, %eax % movl (%esp), %ebx % # some frame pointer epilogue reordered here % shrl %cl, %eax The second case may be what you already fixed on amd64 (only?) -- use rdtsc32() instead of (u_int)rdtsc(). I've always thought that the dynamic shift is overengineered, and now like it even less. The following is efficent and works well enough in all currently physically possible cases: % /* % * Don't really need a separate one for `low', but now it costs less % * (1 shift instruction at runtime and some space). Must change % * tc_counter_mask to match. % */ % u_int % tsc_get_timecount_low(struct timecounter *tc) % { % #ifdef SMP % /* % * Works up to 1024 GHz, assuming that nontemporalness scales with % * freq. I think 8 is too many. But now do extra for SMP indep. % * of freq. % */ % return (((u_int)rdtsc()) >> 8); /* gens rdtsc; shrl $8,%eax */ % #else % /* Works up to 8 GHz. */ % return (((u_int)rdtsc()) >> 1); /* gens rdtsc; shrl %eax */ % #endif % } Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110618195655.M889>