Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 28 May 2015 01:21:31 +1000 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Kurt Lidl <lidl@pix.net>
Cc:        Bruce Evans <brde@optusnet.com.au>, Eitan Adler <eadler@freebsd.org>,  Adrian Chadd <adrian@freebsd.org>,  "src-committers@freebsd.org" <src-committers@freebsd.org>,  "svn-src-all@freebsd.org" <svn-src-all@freebsd.org>,  "svn-src-head@freebsd.org" <svn-src-head@freebsd.org>
Subject:   Re: svn commit: r281103 - head/sys/amd64/amd64
Message-ID:  <20150528002223.O3265@besplex.bde.org>
In-Reply-To: <5565CC49.1020800@pix.net>
References:  <201504050518.t355IFVJ001786@svn.freebsd.org> <20150405163305.A2515@besplex.bde.org> <CAF6rxgkZA=GbyQFhQC63c9z%2By_ki%2Byjt6fZW%2BP9cHve5L=pYoA@mail.gmail.com> <20150406152653.K1066@besplex.bde.org> <5565CC49.1020800@pix.net>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 27 May 2015, Kurt Lidl wrote:

> On 4/6/15 1:42 AM, Bruce Evans wrote:
>> On Mon, 6 Apr 2015, Eitan Adler wrote:
>> 
>>> + a few people interested in the diff
>>> 
>>> On 5 April 2015 at 02:55, Bruce Evans <brde@optusnet.com.au> wrote:
>>>> On Sun, 5 Apr 2015, Eitan Adler wrote:
>>> 
>>> I did not confirm the performance impact, but the submitter and others
>>> indicated they saw a difference.
>>> 
>>> Do you have specific data that shows that there was an improvement?
>> 
>> Only micro-benchmark output that indicates little difference.  This
>> is probably very MD (depending on write combining hardware), so you
>> might only see a difference on some systems.
>> 
>> I also have micro-benchmark output for network packets/second that
>> shows 10% differences for the change of adding 1 byte of padding
>> in code that is never executed.  This seems to be due to different
>> cache misses.  To eliminate differences from this (except ones
>> caused by actually running different code), create a reference
>> version by padding the functions or data to be changed so that
>> the change doesn't affect the address of anything except the
>> internals of the changed parts.
>> 
>> I might try a makeworld run to see if changing the non-temporal
>> accesses in pagecopy and pagezero to cached.
>
> I ran a few (total of 12) buildworld runs after this discussion.
> I finally got around to posting the results to the original bug.
>
> The data is here:
>
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199151#c3

I can't read that, but ran many related benchmarks on a new system.

Haswell CPUs have very fast "rep movsb" for large copies within the L1
cache.  These run at 32 bytes/cycle.  Nothing except copying through
AVX registers can get anywhere near this.  The next best is copying
through SSE registers at 16 bytes/cycle.  However, the L1 cache is
not very large, and "rep movsb" has a large setup overhead -- about
23 cycles.  1 4K page is barely large enough for the setup overhead
to not dominate.  It takes 23 cycles to set up, then only 128 more
cycles at 32 bytes/cycle to do the work.

Page zeroing and copying is rarely within the L1 cache.  Within the
L2 cache, the speed of "rep movsb" drops to only about 8 bytes/cycle.
Copying through SSE registers can easily keep up with this, perhaps
even in a non-unrolled loop.  Copying through 64-bit integer registers
as for page zeroing and copying on amd64 can not so easily keep up
with this.  I think a non-unrolled loop runs at about 2 cycles/iteration.
That only does 4 bytes/cycle with 64-bit registers, and only 2
bytes/cycle with 32-bit registers.  amd64 uses 4-way unrolling.
Apparently, the load/store instructions run at at most 1 pair per
cycle, giving a maximum of 8 bytes per cycle; then any loop overhead
the throughpoint to less than 8 bytes/cycle, so more unrolling helps
a little.

Page zeroing and copying might also rarely be within the L2 or L3
cache.  Then the speed on Hasell drops to that of main memory, which
is about 1.25 bytes/cycle on my system.  Almost any method can keep
up with this in theory, but in practice nontemporal stores through SSE
registers (movntps is best for portability) are fastest (not counting
their advantage of not thrashing the caches), and "rep movsb" is almost
as good, and 128-bit accesses through SSE registers are almost as good
as "rep movsb".  However, in tests of makeworld on i386 systems with
non-bloated workds it was better by 1-2% to not use nontemporal stores
at all.  i386 only uses them for pagezero, and only uses 32-bit movnti
for them.  Replacing this with simple memcpy (which uses "rep stosl",
which runs at the same speed as "rep movsb") gave the 1-2% improvement.
I also tried using 32-bit movnti for pagecopy -- this gave a 1-2%
unimprovement.  Perhaps the 32-bit accesses are too small, but "rep
movsb" is so fast that it is hard to beat.  (This was with current
kernels and an old userland.  My version uses movntps for i386 pagecopy
and pagezero, and this gives improvements in the 1-2% range on older
CPUs.)

For nontemporal stores to be a pessimization, the page zeroing and
copying must often give more hits later when nontemporal stores are
not used.  This is possible with cache sizes of several MB and my
non-bloated world where the compiler's size is 5MB instead of 50MB as
in -current.  The cache size on my CPU is 8MB.  This is shared with 4
real cores and 4 HTT cores so it is only 1MB per CPU and only about
512K per runnable thread with -j16.  Even 5MB is more than enough bloat
to thrash 512K.  However, there is apparently enough locality for
caching to help even for zeroed pages.  If the zeroing happens on
demand, than it is likely for the page to be accessed soon, so caching
it helps.  If the zeroing happens in advance, then under load perhaps
zeroed pages get used soon enough that any caching of them helps.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150528002223.O3265>