From owner-svn-src-head@FreeBSD.ORG  Mon Jun 24 11:13:35 2013
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id F1EFDC0F;
 Mon, 24 Jun 2013 11:13:34 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from fallbackmx07.syd.optusnet.com.au
 (fallbackmx07.syd.optusnet.com.au [211.29.132.9])
 by mx1.freebsd.org (Postfix) with ESMTP id 3CFEC1FDC;
 Mon, 24 Jun 2013 11:13:33 +0000 (UTC)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
 [211.29.132.186])
 by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
 r5OBDPYu003168; Mon, 24 Jun 2013 21:13:25 +1000
Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au
 (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23])
 by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id r5OBDBAg012716
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Mon, 24 Jun 2013 21:13:13 +1000
Date: Mon, 24 Jun 2013 21:13:11 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Gleb Smirnoff <glebius@FreeBSD.org>
Subject: Re: svn commit: r252032 - head/sys/amd64/include
In-Reply-To: <20130624081215.GE1214@FreeBSD.org>
Message-ID: <20130624182434.C2235@besplex.bde.org>
References: <201306201430.r5KEU4G5049115@svn.freebsd.org>
 <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org>
 <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org>
 <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org>
 <20130622110352.J2033@besplex.bde.org> <20130624081215.GE1214@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.0 cv=K8x6hFqI c=1 sm=1 a=0l9hOOMEwYoA:10
 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=gvJhbHXk4isA:10
 a=Q7LB4oE57_mNIwZ1stIA:9 a=CjuIK1q_8ugA:10 a=MI5ZuonP8gEuQQEl:21
 a=UBaAw9HGa8exVk69:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
 src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>,
 Bruce Evans <brde@optusnet.com.au>
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 24 Jun 2013 11:13:35 -0000

On Mon, 24 Jun 2013, Gleb Smirnoff wrote:

>  did you run your benchmarks in userland or in kernel? How many
> parallel threads were updating the same counter?
>
>  Can you please share your benchmarks?

Only userland, with 1 thread.

I don't have any more benchmarks than the test program in previous mail.

I don't see how threads have anything to do with efficiency of counter
incrementation, unless slow locking is used.  With threads for the same
process on the same CPU, the accesses are not really different from
accesses by a single thread.  With threads for different processes on
the same CPU, switching the address space will thrash the cache for
user threads, but the pcpu area in kernel memory shouldn't be switched.
It seems difficult to simulate pcpu in user address space.  With threads
for the same or different processes on different CPUs, there is no
contention for pcpu counters.

Please remind me of your old tests that did show some efficiency
differences.  IIRC, direct increment was unexpectedly slower.  Was
that on a 64-bit system?  I guess it wasn't, since the committed version
just uses a direct increment on amd64.  On i386 with cmpxch86b, using
cmpxchg8b might be more efficient because it is a 64-bit access.  I
don't see how that can be, since 4 32-bit accesses are needed to set
up the cmpxchg8b.  In fact, 1 of these accesses can be extremely slow
since it has a store-to-load penalty on some arches (I have considerable
experience with store-to-load penalties in FP code and large uncommitted
asms in libm to avoid them.).

Here is the access which is likely to have the penalty:

% static inline void
% counter_64_inc_8b(uint64_t *p, int64_t inc)
% {
% 
% 	__asm __volatile(
% 	"movl	%%fs:(%%esi),%%eax\n\t"

The previous store was a cmpchg8b.  Presumably that was 64 bits.  No
problem for this load, since it is at the same address as the store
and its size mismatch doesn't have the penalty on any CPU that I
know of.

% 	"movl	%%fs:4(%%esi),%%edx\n"

Store to load mismatch penalty on at least AthlonXP and Athlon64.  The
load is from the middle of a 64-bit store, and at least these CPUs
don't have hardware to forward it from the write buffer.  Costs 10-20
cycles.  Phenom is documented to have extra hardware to make this case
as fast as the previous case.  I haven't tested Pheonom.  Acccording
to FP benchmarks, store-to-load penalties are large on core2 and corei7
too.

% "1:\n\t"
% 	"movl	%%eax,%%ebx\n\t"
% 	"movl	%%edx,%%ecx\n\t"
% 	"addl	(%%edi),%%ebx\n\t"
% 	"adcl	4(%%edi),%%ecx\n\t"

These extra memory accesses are unfortunately necessary because there
aren't enough registers and the asm is a bit too simple (e.g., to add
1, more complicated asm could just add $1 with carry here, but the
current asm has to store 1 to a 64-bit temporary memory variable so
that it can be loaded here).  These are all 32-bit accesses so they
don't have penalties.  There is just a lot of memory traffic for them.

% 	"cmpxchg8b %%fs:(%%esi)\n\t"

This presumably does a 64-bit load followed by a 64-bit store (when it
succeeds).  The load matches the previous store, so there is no penalty.

% 	"jnz	1b"
% 	:
% 	: "S" ((char *)p - (char *)&__pcpu[0]), "D" (&inc)
% 	: "memory", "cc", "eax", "edx", "ebx", "ecx");
% }

The penalty may be unimportant in normal use because loads are normally
separated from stores by long enough to give the write buffers a chance
to flush to the cache.  But loop benchmarks will always see it unless
loop does enough things between the store and the load to give the
large separation.  Note that the penalty affects loads, so its latency
is normally not hidden.

I forgot about this when I ran tests on Athlon64.  Athlon64 was only
about 6 cycles slower than core2, for about 20 cycles per iteration
altogther.  Not much more, but 20 is about the penalty time, so maybe
the loop is ending up testing just the penalty time, with all the other
latencies in parallel with the penalty.  For a quick test of this, I
replaced the load that has the penalty by a load of immediate 0.  This
reduced the time to 14.5 cycles.  So the penalty is at least 5.5 cycles.
(Note that in the benchmark, the counter only goes up to about 2
billion, so the high 32 bits always has value 0, so loading immediate
0 gives the same result.)  On core2 (ref10-i386) and corei7 (freefall),
the same change has no effect on the time.  This shows that the penalty
doesn't apply on core2 or corei7, and the FP penalties that I see there
have a different source.  ... Testing shows that they are for loads
of 64-bit values that are mismatched since the value was built up using
2 32-bit stores.  Test program:

% #include <stdint.h>
% 
% uint64_t foo;
% 
% int
% main(void)
% {
% 	unsigned i;
% 
% 	for (i = 0; i < 2666813872; i++)	/* sysctl -n machdep.tsc_freq */
% 		asm volatile(
% #ifdef NO_PENALTY
% 		    "movq %%rax,foo; movq foo,%%rax"
% 		    : : : "rax");
% #else
% 		    "movl %%eax,foo; movl %%eax,foo+4; movq foo,%%rax"
% 		    : : : "rax");
% #endif
% }

This shows a penalty of 10 cycles on freefall (5+ cycles without the penalty
and 15+ with it).

To test on i386, SSE must be used:

% #include <stdint.h>
% 
% double foo;
% 
% int
% main(void)
% {
% 	unsigned i;
% 
% 	for (i = 0; i < 1861955704; i++)	/* sysctl -n machdep.tsc_freq */
% 		asm volatile(
% #ifdef NO_PENALTY
% 		    "movsd %%xmm0,foo; movsd foo,%%xmm0"
% 		    : : : "xmm0");
% #else
% 		    "movl $0,foo; movl $0,foo+4; movsd foo,%%xmm0"
% 		    : : : "xmm0");
% #endif
% }

The penalty is relatively even larger on freefall, since SSE is faster
for some reason.  Now the no-penalty case takes 4.5+ cycles and the
penalty case takes 14.7+ cycles.  On ref10-i386, the penalty case takes
13 cycles and the non-penalty case 5.  On Athlon64 (i386), the penalty
case takes 20 cycles and the non-penalty case 9.  Athlon64 apparently
handles SSE poorly here.  It takes only 5 cycles for 2 matched 32-bit
loads and stores.

Normal code avoids these penalties by not mixing loads and stores of
different widths.  FP code that does things in bits runs into them
in 32-bit mode because normal memory accesses in FP code are for
doubles and long doubles and have access widths 8 and 8+2, respectively,
but to access bits in 32-bit mode dumb source code and compilers do
32-bit accesses.  The fix in FP code is to use SSE packing, unpacking
and shuffling operations, operations to keep the access widths the same. 
These take a while, but not as long as the penalty, and their latency
can be hidden in pipelines better than the penalty.

In the counter increment code, many fixes are possible:
- simplify the code, like I have been trying to do
- avoid using cmpxch8b in the usual case where only the low word changes.
   Use cmpxchg on the low word then.
- use cmpxchg8b for the initial load.  This didn't work at all.  It was
   2 cycles slower on Athlon64 where it might help, and about 8 cycles
   slower on core2 where it is not needed.  (I used a sloppy version:
   movl $-1 to %edx so that it doesn't match; then replace the initial
   load by cmpxhcg8b.)  cmpxchg8b takes about 9 cycles on Athlon64 and
   core2, and about 4 on corei7.  On Athlon64, using it for the initial
   load avoids the store-to-load penalty but doesn't quite break even
   since it takes so long.  On core2 and corei7, using it just adds its
   slowness.

You might not believe my timing.  Check them in recent vendor docs and
Fog's web site.  The old (2002) Athlon (paper) manual that I have
handy gives the following latencies: 6 for cmpxchg and 39 for cmpxchg8b.
If cmpx8chg8b was really that much slower on old CPUs, it should be
avoided more.

Bruce