Date: Sun, 5 Apr 2015 16:55:23 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Eitan Adler <eadler@freebsd.org> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org Subject: Re: svn commit: r281103 - head/sys/amd64/amd64 Message-ID: <20150405163305.A2515@besplex.bde.org> In-Reply-To: <201504050518.t355IFVJ001786@svn.freebsd.org> References: <201504050518.t355IFVJ001786@svn.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 5 Apr 2015, Eitan Adler wrote: > Log: > adrian asked me to revert and get more testing > > Modified: > head/sys/amd64/amd64/support.S > > Modified: head/sys/amd64/amd64/support.S > ============================================================================== > --- head/sys/amd64/amd64/support.S Sun Apr 5 05:14:20 2015 (r281102) > +++ head/sys/amd64/amd64/support.S Sun Apr 5 05:18:14 2015 (r281103) > @@ -73,11 +73,7 @@ ENTRY(pagezero) > movnti %rax,8(%rdi,%rdx) > movnti %rax,16(%rdi,%rdx) > movnti %rax,24(%rdi,%rdx) > - movnti %rax,32(%rdi,%rdx) > - movnti %rax,40(%rdi,%rdx) > - movnti %rax,48(%rdi,%rdx) > - movnti %rax,56(%rdi,%rdx) > - addq $64,%rdx > + addq $32,%rdx > jne 1b > sfence > POP_FRAME_POINTER My tests show that such changes make no difference unless you throttle the CPU to make the memory fast in comparision. Counting cycles shows that no unrolling is useful on a 4GHz CPU with memory slower than 16GB/sec. But there may be complications for write buffering. 8 bytes at a time is too small, so although movnti bypasses the caches it must go through write buffers to combine writes. Then writing in groups of the same size as the write buffer may be best. Tests on ref11-amd64 now show no significant difference between 4-way, 1-way, 8-way and 2-way unrolling. 4-way seems to be insignificantly slowest and 2-way and 8-way insignificantly equally fastest. "Insignificantly" means less than 2% on a micro-benchmark but there is some variance whch I wasn't careful to determine. Old tests show that if you improve the speed of pagecopy and pagezero by a lot more than 1% like I do for i386-with-no-SSE2 using movntps, then you get insignificant speedups for makeworld. (-current uses movnti for both pagecopy and pagezero on amd64, but on i386 it only uses movnti for sse2_pagezero.) movnti is used to bypass the cache. It is not clear that this is best. Bypassing the cache for other things seemed to give just large complexity for a small loss. I got best results for makeworld from an old version of FreeBSD that did page zeroing in idle context. That is unsupported/broken/done differently now. With page zeroing in idle context, bypassing the cache is clearly right, and the speed of pagezero doesn't matter much iff it is executed in idle context, and it should run slower if necessary to bypass the cache. It is probably wrong to bypass the cache for zeroing on demand. Then at least the bytes that caused the page to be demanded are sure to be used soon. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150405163305.A2515>