Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Jun 2011 21:08:06 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Jung-uk Kim <jkim@FreeBSD.org>
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org
Subject:   Re: svn commit: r223211 - head/sys/x86/x86
Message-ID:  <20110618195655.M889@besplex.bde.org>
In-Reply-To: <201106172141.p5HLf6Rx009154@svn.freebsd.org>
References:  <201106172141.p5HLf6Rx009154@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 17 Jun 2011, Jung-uk Kim wrote:

> Log:
>  Teach the compiler how to shift TSC value efficiently.  As noted in r220631,
>  some times compiler inserts redundant instructions to preserve unused upper
>  32 bits even when it is casted to a 32-bit value.  Unfortunately, it seems
>  the problem becomes more serious when it is shifted, especially on amd64.

Er, I tried to point out how to optimize this code before (but didn't
reply to your reply), and it's not by using more asm.

> Modified: head/sys/x86/x86/tsc.c
> ==============================================================================
> --- head/sys/x86/x86/tsc.c	Fri Jun 17 21:31:13 2011	(r223210)
> +++ head/sys/x86/x86/tsc.c	Fri Jun 17 21:41:06 2011	(r223211)
> @@ -461,7 +461,7 @@ init_TSC_tc(void)
> 		tsc_timecounter.tc_quality = 1000;
>
> init:
> -	for (shift = 0; shift < 32 && (tsc_freq >> shift) > max_freq; shift++)
> +	for (shift = 0; shift < 31 && (tsc_freq >> shift) > max_freq; shift++)
> 		;
> 	if (shift > 0) {
> 		tsc_timecounter.tc_get_timecount = tsc_get_timecount_low;

shift == 32 (or even shift == 31) is unreachable.  A shift of 31 will shift
2GHz down to 1 Hz, or support physically impossible frequencies like
2**33 GHz.  OTOH, shifts of up to 63 are supported by the slow gcc code.

> @@ -579,6 +579,9 @@ tsc_get_timecount(struct timecounter *tc
> static u_int
> tsc_get_timecount_low(struct timecounter *tc)
> {
> +	uint32_t rv;
>
> -	return (rdtsc() >> (int)(intptr_t)tc->tc_priv);
> +	__asm __volatile("rdtsc; shrd %%cl, %%edx, %0"
> +	: "=a" (rv) : "c" ((int)(intptr_t)tc->tc_priv) : "edx");

Lexical style bug (indentation of second line of the asm).

> +	return (rv);
> }

Just return the shift of the low 32 bits (and change tc_counter_mask to
match) like I said.  This loses only the accidental ability for the
timecounter to work for more than a few seconds when interrupts are
stopped by something like ddb, since any shift count that loses too
many of the low 32 bits will not work for other reasons.  For example,
suppose that the TSC frequency is 8G-1Hz, which is unavailable except
possible in research labs.  This must be shifted by 1 to fit in 32
bits.  If we use only the low 32 bits, then we end up with only 31
significant bits and tsc_get_timecount_low() wraps after ~2 seconds
instead of after the best possible for this shift of ~4 seconds.  If
we shift by 7 more, as we do in the SMP case, then if we start with
32 bits then we end up with 24 bits, but the wrap still takes 2
seconds; if we start with 64 bits then we end up with 32 bits and the
wrap takes 4*2**7 = 512 seconds.  But wrap times longer than 1/HZ times
a few are not needed.  2 seconds is already at least 100 or 1000 times
longer than needed, depending on HZ.  The case where the unscaled
frequency is 4G-1Hz and !SMP gives a shift count of 0 and a wrap time
of ~4 seconds.  Whatever is done to make that case work (say, not allowing
a fully tickless kernel with HZ = 0), works almost as well up to an
unscaled frequency of 8GHz which is still far off.

No one will notice these micro-optimizations, but I agree that the
redundant instructions are ugly.  I get the following on i386 for the
original version with an old source tree:

% #APP
% 	rdtsc
% #NO_APP
% 	movl	8(%ebp), %ecx
% 	movl	28(%ecx), %ecx
% 	shrdl	%edx, %eax
% 	shrl	%cl, %edx
% 	testb	$32, %cl
% 	je	.L3
% 	movl	%edx, %eax
% 	xorl	%edx, %edx
% .L3:

The last 4 instructions are not redundant, but are needed to support
shifts of up to 63 (maybe 64).

I tried masking the shift count with 0x1f so that the shift count is
known to be < 32, this just gave an extra instruction for the masking.

It's even worse with rdtsc() converted to u_int first like I want:

% 	movl	%ebx, (%esp)
% 	movl	%esi, 4(%esp)
% #APP
% 	rdtsc
% #NO_APP
% 	movl	%eax, %ebx
% 	movl	8(%ebp), %eax
% 	movl	4(%esp), %esi
% 	movl	28(%eax), %ecx
% 	movl	%ebx, %eax
% 	movl	(%esp), %ebx
% 	# some frame pointer epilogue reordered here
% 	shrl	%cl, %eax

The second case may be what you already fixed on amd64 (only?) -- use
rdtsc32() instead of (u_int)rdtsc().

I've always thought that the dynamic shift is overengineered, and now like
it even less.  The following is efficent and works well enough in all
currently physically possible cases:

% /* 
%  * Don't really need a separate one for `low', but now it costs less
%  * (1 shift instruction at runtime and some space).  Must change
%  * tc_counter_mask to match.
%  */
% u_int
% tsc_get_timecount_low(struct timecounter *tc)
% {
% #ifdef SMP
% 	/*
%	 * Works up to 1024 GHz, assuming that nontemporalness scales with
%	 * freq.  I think 8 is too many.  But now do extra for SMP indep.
% 	 * of freq.
%	 */
% 	return (((u_int)rdtsc()) >> 8);	/* gens rdtsc; shrl $8,%eax */
% #else
% 	/* Works up to 8 GHz. */
% 	return (((u_int)rdtsc()) >> 1);	/* gens rdtsc; shrl %eax */
% #endif
% }

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110618195655.M889>