Date: Mon, 24 Jun 2013 21:13:11 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Gleb Smirnoff <glebius@FreeBSD.org> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130624182434.C2235@besplex.bde.org> In-Reply-To: <20130624081215.GE1214@FreeBSD.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130624081215.GE1214@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 24 Jun 2013, Gleb Smirnoff wrote: > did you run your benchmarks in userland or in kernel? How many > parallel threads were updating the same counter? > > Can you please share your benchmarks? Only userland, with 1 thread. I don't have any more benchmarks than the test program in previous mail. I don't see how threads have anything to do with efficiency of counter incrementation, unless slow locking is used. With threads for the same process on the same CPU, the accesses are not really different from accesses by a single thread. With threads for different processes on the same CPU, switching the address space will thrash the cache for user threads, but the pcpu area in kernel memory shouldn't be switched. It seems difficult to simulate pcpu in user address space. With threads for the same or different processes on different CPUs, there is no contention for pcpu counters. Please remind me of your old tests that did show some efficiency differences. IIRC, direct increment was unexpectedly slower. Was that on a 64-bit system? I guess it wasn't, since the committed version just uses a direct increment on amd64. On i386 with cmpxch86b, using cmpxchg8b might be more efficient because it is a 64-bit access. I don't see how that can be, since 4 32-bit accesses are needed to set up the cmpxchg8b. In fact, 1 of these accesses can be extremely slow since it has a store-to-load penalty on some arches (I have considerable experience with store-to-load penalties in FP code and large uncommitted asms in libm to avoid them.). Here is the access which is likely to have the penalty: % static inline void % counter_64_inc_8b(uint64_t *p, int64_t inc) % { % % __asm __volatile( % "movl %%fs:(%%esi),%%eax\n\t" The previous store was a cmpchg8b. Presumably that was 64 bits. No problem for this load, since it is at the same address as the store and its size mismatch doesn't have the penalty on any CPU that I know of. % "movl %%fs:4(%%esi),%%edx\n" Store to load mismatch penalty on at least AthlonXP and Athlon64. The load is from the middle of a 64-bit store, and at least these CPUs don't have hardware to forward it from the write buffer. Costs 10-20 cycles. Phenom is documented to have extra hardware to make this case as fast as the previous case. I haven't tested Pheonom. Acccording to FP benchmarks, store-to-load penalties are large on core2 and corei7 too. % "1:\n\t" % "movl %%eax,%%ebx\n\t" % "movl %%edx,%%ecx\n\t" % "addl (%%edi),%%ebx\n\t" % "adcl 4(%%edi),%%ecx\n\t" These extra memory accesses are unfortunately necessary because there aren't enough registers and the asm is a bit too simple (e.g., to add 1, more complicated asm could just add $1 with carry here, but the current asm has to store 1 to a 64-bit temporary memory variable so that it can be loaded here). These are all 32-bit accesses so they don't have penalties. There is just a lot of memory traffic for them. % "cmpxchg8b %%fs:(%%esi)\n\t" This presumably does a 64-bit load followed by a 64-bit store (when it succeeds). The load matches the previous store, so there is no penalty. % "jnz 1b" % : % : "S" ((char *)p - (char *)&__pcpu[0]), "D" (&inc) % : "memory", "cc", "eax", "edx", "ebx", "ecx"); % } The penalty may be unimportant in normal use because loads are normally separated from stores by long enough to give the write buffers a chance to flush to the cache. But loop benchmarks will always see it unless loop does enough things between the store and the load to give the large separation. Note that the penalty affects loads, so its latency is normally not hidden. I forgot about this when I ran tests on Athlon64. Athlon64 was only about 6 cycles slower than core2, for about 20 cycles per iteration altogther. Not much more, but 20 is about the penalty time, so maybe the loop is ending up testing just the penalty time, with all the other latencies in parallel with the penalty. For a quick test of this, I replaced the load that has the penalty by a load of immediate 0. This reduced the time to 14.5 cycles. So the penalty is at least 5.5 cycles. (Note that in the benchmark, the counter only goes up to about 2 billion, so the high 32 bits always has value 0, so loading immediate 0 gives the same result.) On core2 (ref10-i386) and corei7 (freefall), the same change has no effect on the time. This shows that the penalty doesn't apply on core2 or corei7, and the FP penalties that I see there have a different source. ... Testing shows that they are for loads of 64-bit values that are mismatched since the value was built up using 2 32-bit stores. Test program: % #include <stdint.h> % % uint64_t foo; % % int % main(void) % { % unsigned i; % % for (i = 0; i < 2666813872; i++) /* sysctl -n machdep.tsc_freq */ % asm volatile( % #ifdef NO_PENALTY % "movq %%rax,foo; movq foo,%%rax" % : : : "rax"); % #else % "movl %%eax,foo; movl %%eax,foo+4; movq foo,%%rax" % : : : "rax"); % #endif % } This shows a penalty of 10 cycles on freefall (5+ cycles without the penalty and 15+ with it). To test on i386, SSE must be used: % #include <stdint.h> % % double foo; % % int % main(void) % { % unsigned i; % % for (i = 0; i < 1861955704; i++) /* sysctl -n machdep.tsc_freq */ % asm volatile( % #ifdef NO_PENALTY % "movsd %%xmm0,foo; movsd foo,%%xmm0" % : : : "xmm0"); % #else % "movl $0,foo; movl $0,foo+4; movsd foo,%%xmm0" % : : : "xmm0"); % #endif % } The penalty is relatively even larger on freefall, since SSE is faster for some reason. Now the no-penalty case takes 4.5+ cycles and the penalty case takes 14.7+ cycles. On ref10-i386, the penalty case takes 13 cycles and the non-penalty case 5. On Athlon64 (i386), the penalty case takes 20 cycles and the non-penalty case 9. Athlon64 apparently handles SSE poorly here. It takes only 5 cycles for 2 matched 32-bit loads and stores. Normal code avoids these penalties by not mixing loads and stores of different widths. FP code that does things in bits runs into them in 32-bit mode because normal memory accesses in FP code are for doubles and long doubles and have access widths 8 and 8+2, respectively, but to access bits in 32-bit mode dumb source code and compilers do 32-bit accesses. The fix in FP code is to use SSE packing, unpacking and shuffling operations, operations to keep the access widths the same. These take a while, but not as long as the penalty, and their latency can be hidden in pipelines better than the penalty. In the counter increment code, many fixes are possible: - simplify the code, like I have been trying to do - avoid using cmpxch8b in the usual case where only the low word changes. Use cmpxchg on the low word then. - use cmpxchg8b for the initial load. This didn't work at all. It was 2 cycles slower on Athlon64 where it might help, and about 8 cycles slower on core2 where it is not needed. (I used a sloppy version: movl $-1 to %edx so that it doesn't match; then replace the initial load by cmpxhcg8b.) cmpxchg8b takes about 9 cycles on Athlon64 and core2, and about 4 on corei7. On Athlon64, using it for the initial load avoids the store-to-load penalty but doesn't quite break even since it takes so long. On core2 and corei7, using it just adds its slowness. You might not believe my timing. Check them in recent vendor docs and Fog's web site. The old (2002) Athlon (paper) manual that I have handy gives the following latencies: 6 for cmpxchg and 39 for cmpxchg8b. If cmpx8chg8b was really that much slower on old CPUs, it should be avoided more. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130624182434.C2235>