Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 5 Apr 2015 16:55:23 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Eitan Adler <eadler@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r281103 - head/sys/amd64/amd64
Message-ID:  <20150405163305.A2515@besplex.bde.org>
In-Reply-To: <201504050518.t355IFVJ001786@svn.freebsd.org>
References:  <201504050518.t355IFVJ001786@svn.freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 5 Apr 2015, Eitan Adler wrote:

> Log:
>  adrian asked me to revert and get more testing
>
> Modified:
>  head/sys/amd64/amd64/support.S
>
> Modified: head/sys/amd64/amd64/support.S
> ==============================================================================
> --- head/sys/amd64/amd64/support.S	Sun Apr  5 05:14:20 2015	(r281102)
> +++ head/sys/amd64/amd64/support.S	Sun Apr  5 05:18:14 2015	(r281103)
> @@ -73,11 +73,7 @@ ENTRY(pagezero)
> 	movnti	%rax,8(%rdi,%rdx)
> 	movnti	%rax,16(%rdi,%rdx)
> 	movnti	%rax,24(%rdi,%rdx)
> -	movnti	%rax,32(%rdi,%rdx)
> -	movnti	%rax,40(%rdi,%rdx)
> -	movnti	%rax,48(%rdi,%rdx)
> -	movnti	%rax,56(%rdi,%rdx)
> -	addq	$64,%rdx
> +	addq	$32,%rdx
> 	jne	1b
> 	sfence
> 	POP_FRAME_POINTER

My tests show that such changes make no difference unless you throttle
the CPU to make the memory fast in comparision.

Counting cycles shows that no unrolling is useful on a 4GHz CPU with
memory slower than 16GB/sec.  But there may be complications for
write buffering.  8 bytes at a time is too small, so although movnti
bypasses the caches it must go through write buffers to combine writes.
Then writing in groups of the same size as the write buffer may be
best.

Tests on ref11-amd64 now show no significant difference between 4-way,
1-way, 8-way and 2-way unrolling.  4-way seems to be insignificantly
slowest and 2-way and 8-way insignificantly equally fastest.
"Insignificantly" means less than 2% on a micro-benchmark but there
is some variance whch I wasn't careful to determine.

Old tests show that if you improve the speed of pagecopy and pagezero
by a lot more than 1% like I do for i386-with-no-SSE2 using movntps,
then you get insignificant speedups for makeworld.  (-current uses
movnti for both pagecopy and pagezero on amd64, but on i386 it only
uses movnti for sse2_pagezero.)

movnti is used to bypass the cache.  It is not clear that this is best.
Bypassing the cache for other things seemed to give just large complexity
for a small loss.

I got best results for makeworld from an old version of FreeBSD that
did page zeroing in idle context.  That is unsupported/broken/done
differently now.  With page zeroing in idle context, bypassing the
cache is clearly right, and the speed of pagezero doesn't matter much
iff it is executed in idle context, and it should run slower if necessary
to bypass the cache.  It is probably wrong to bypass the cache for zeroing
on demand.  Then at least the bytes that caused the page to be demanded
are sure to be used soon.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150405163305.A2515>