Date: Sat, 22 Jun 2013 12:35:33 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Gleb Smirnoff <glebius@FreeBSD.org> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130622110352.J2033@besplex.bde.org> In-Reply-To: <20130621135427.GA1214@FreeBSD.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 21 Jun 2013, Gleb Smirnoff wrote: > On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote: > B> Not if it is a 32-bit increment on 32-bit systems, as it should be. > B> > B> I said to use a daemon to convert small (16 or 32 bit) counters into > B> larger (32 or 64 bit) ones. It is almost as efficient to call the > B> accumulation code directly: Please trim quotes more. >[... lots of technical details] > May be you are right and an implementation with a "daemon" would be > more efficient for i386. But daemon isn't needed at all on 64-bit > platforms. So I doubt it is worth coding it. Of course it would be more efficient on i386. Probably it would be more efficient on amd64 too. The main counter array would be half the size, so takes less cache, less write buffers, and smaller instructions. > Frankly speaking, we already do not consider the i386 as a platform to > build high performance solutions. The need for counter(9) in TCP/IP > stack was demonstrated on modern amd64 hardware. Thus, our motivation > was to make: I need to complain about lost performance in i386 more then :-). i386 is faster for anything that doesn't need large memory. > - a lossless and faster implementation for amd64 > - a lossless implementation for !amd64 > - if we have an option to avoid critical section on some arch, good > let's avoid it. I already gave an implementation to avoid the critical section. But using the critical section seems to be the second best method. The best method is using 32-bit counters and a daemon. Here are considerably expanded tests, with noninline tests dropped. Summary of times on Athlon64: simple increment: 4-7 cycles (1) simple increment preceded by feature test: 5-8 cycles (1) simple 32-bit increment: 4-7 cycles (2) correct 32-bit increment (addl to mem): 5.5-7 cycles (3) inlined critical section: 8.5 cycles (4) better inlined critical section: 7 cycles (5) correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) "improve" previous to allow immediate operand: 5+ cycles correct signed 32-bit inc of 64-bit counter: 8.5-9 cycles (7) correct 64-bit inc of 64-bit counter: 8-9 cycles (8) -current method (cmpxchg8b): 18 cycles (1) Unusable due to races. (2) It's no faster to do a simple increment of a 32-bit counter than a 64-bit one! This indicates delicate pipelining/memory access behaviour. The Athlon64 timing for addl to memory is 4 cycles latency. The tests show this being achieved. But since it is only latency and an Athlon64 can do something like 1 64-bit store and 1 64-bit load per cycle (or 2 64-bit loads but not 2 64-bit stores per cycle?), there is plenty of time to do the adcl to memory and also the loop overhead in not-quite-the-same 4 cycles, so that the loop throughput is 1 per 4 cycles. In practical/non-loop use, you would care even less about the store latancy. Compilers often mess this up to get 7 cycles instead of 4. -O2 generally gives worst results, by unrolling the loop. Since the rolled loop can excecute in 4 cycles, unrolling it can only pessimize it. (3) The single 32-bit addl, combined with a daemon, is what should be used. But compilers generated worse code for this than the similar but more branchy code in (6). (4) The critical section method is quite fast when inlined. (5) The critical section method is even faster when optimized. This is what should be used if you don't want the complications for the daemon. (6) Call the daemon directly ("daemon" code is now just the current code in an extern function. The extern function is a dummy that is never called in the tests. Just to simulate the branch prediction penalties). Some restrictions on the increment. (7) Same as (6), with fewer restrictions on the increment. This uses cmpxchg just like the -current method uses cmpxchg8b, provided a 32-bit increment is sufficient. Races are detected as the low bits changing. We don't care if the high bits change while we are changing the low bits. (8) Same as (7), with no restrictions on the increment. The full version somehow turned out to have a better fastest case than (7). This is without even an optimization for constant increments that would eliminate the runtime check in most cases. The extra code for the runtime check makes things faster :-). % #include <stdint.h> % #include <stdio.h> % % static inline void % counter_64_inc_8b(volatile uint64_t *p, int64_t inc) % { % % __asm __volatile( % "movl %%ds:(%%esi),%%eax\n\t" % "movl %%ds:4(%%esi),%%edx\n" % "1:\n\t" % "movl %%eax,%%ebx\n\t" % "movl %%edx,%%ecx\n\t" % "addl (%%edi),%%ebx\n\t" % "adcl 4(%%edi),%%ecx\n\t" % "cmpxchg8b %%ds:(%%esi)\n\t" % "jnz 1b" % : % : "S" (p), "D" (&inc) % : "memory", "cc", "eax", "edx", "ebx", "ecx"); % } I also don't like the style of this asm. A better style is used below. The memory clobber pessimization in the above is not used below. The effects of memory clobbers are visible in the test as the loop counter and 'inc' value not staying in registers. cmpxchg8b uses so many registers that it is hard to keep things in registers anyway, so the pessimization is smaller in the above. However, since the add* and/or cmpxchg* instructions have such a large latency, there is plenty of time to reload all the loop variables without increasing the loop latency. In practical use, spilling all the local variables may have a higher cost, since the latency of the counter add is unimportant. Anyway, the asms should be optimized for throughput, not for latency. Unfortunately, loop tests emphasize latency. The only general ways that I know of to optimize throughput for non-loops are: - use highest-throughput instructions, of course - minimise the number of micro-instructions - minimize dependencies. The one for addl to memory is problematic here. Optimal code would schedule the load for this addl early in a free load slot. Then do add and the store at leisure (their latency doesn't matter). The addl to memory instruction does precisely the wrong thing here. The CPU can only sometimes reschedule it well, by seeing it early enough to start the load early. The addq used in the amd64 has the same problem. If possibly, separate non-volatile asms for a load and the rest should be used. The compiler hopefully schedules these well. Then amd64 would need cmpchg* too, to handle the rare case where the counter changes between the separate asms. Reducing latency from 4 cycles to 1-3, or not letting it increase to 20 on i386, may be of no practical importance, but I like to understand the details and either use the perfect algorithm or a simple algorithm. These tests indicate that the amd64 algorithm is simple and but very good, while the i386 algorithm is complicated to give negative benefits. % % uint32_t cpu_feature = 1; % % /* % * These shouldn't be volatile. Volatile just avoids breaking the simulation % * by preventing optimizations that aren't possible for pcpu variables. % */ % typedef volatile uint64_t *counter_u64_t; % typedef volatile uint32_t *counter_u32_t; % % static inline void % counter_u64_add(counter_u64_t c, int64_t inc) % { % % #if 1 % /* 18 cycles on A64. */ % if ((cpu_feature & 1) == 1) { % counter_64_inc_8b(c, inc); % } % #elif 0 % /* 5-8 cycles on A64 (gcc-4.2.1 much slower than gcc-3.3.3).. */ % if ((cpu_feature & 1) == 1) { % *c += inc; % } % #else % /* 4-7 cycles on A64. */ % *c += inc; % #endif % } % % static inline void % counter_u32_add(counter_u32_t c, int32_t inc) % { % % #if 1 % /* 5.5-7 cycles on A64. */ % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % #else % /* 4-7 cycles on A64. */ % *c += inc; % #endif % } % % struct thread % { % uint32_t td_critnest; % } td0; % % volatile struct thread *td = &td0; % % void % dummy_accum(void) % { % } % % static inline void % alt_counter_u64_add(counter_u64_t c, int64_t inc) % { % #if 1 % /* 8.5 cycles on A64. */ % td->td_critnest++; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_critnest++; % #elif 1 % /* 7 cycles on A64. */ % uint32_t ocritnest; % % ocritnest = td->td_critnest; % td->td_critnest = ocritnest + 1; % __asm __volatile("addl %1,%%ds:%0" : "=m,m" (*c) : "?i,r" (inc)); % td->td_critnest = ocritnest; % #elif 0 % /* Support only 32-bit non-negative increments. */ % /* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */ % __asm __volatile(" \n\ % addl %1,%%ds:%0 \n\ % jnc 8f \n\ % call dummy_accum \n\ % 8: ; \n\ % " % : "=m" (*c) : "r" ((uint32_t)inc)); % #elif 0 % /* As in previous, but allow immediate operand. */ % /* 4-7 cycles on A64 (varies with gcc version, -O* and -static. */ % /* % * Using the immediate operand tends to be slower! % * % * But note that we don't use the memory clobber pessimization. % * This allows the inc operand (and the loop counter) to stay % * in registers across the call here, so the register operand % * is naturally faster. gcc should prefer the register operand % * anyway in this case, but is not that smart. It even prefers % * the register operand when the immediate operand is disparaged. % * Sometimes the optimization (just loop alignment?) accidentally % * compensates for making the worst choice. % */ % /* 5 cycles on A64. */ % __asm __volatile(" \n\ % addl %1,%%ds:%0 \n\ % jnc 8f \n\ % call dummy_accum \n\ % 8: ; \n\ % " % : "=m,m" (*c) : "?i,r" ((uint32_t)inc)); % #elif 0 % /* As in previous, but go through a register. Neg. incr. now work. */ % /* 8.5-9 cycles on A64. */ % uint32_t exp, src; % % __asm __volatile(" \n\ % movl %%ds:%4,%1 \n\ % 1: \n\ % movl %1,%2 \n\ % addl %3,%2 \n\ % jnc 7f \n\ % call dummy_accum \n\ % jmp 8f \n\ % 7: \n\ % cmpxchgl %2,%%ds:%0 \n\ % jne 1b \n\ % 8: ; \n\ % " % : "=m" (*c), /* 0 */ % "=a" (exp), /* 1 */ % "=r" (src) /* 2 */ % : "ir" ((uint32_t)inc), /* 3 */ /* don't try to disparage "i" */ % "m" (*c) /* 4 */ % ); % #else % /* Full 64-bit version. */ % /* 8-9 cycles on A64. */ % uint32_t exp, src; % % __asm __volatile(" \n\ % /* Runtime testing of %4 is wasteful if inc is constant. */ \n\ % /* Especially with the high register pressure. */ \n\ % testl %4,%4 \n\ % je 1f \n\ % call dummy_accum \n\ % jmp 8f \n\ % 1: \n\ % movl %%ds:%5,%1 \n\ % 1: \n\ % movl %1,%2 \n\ % addl %3,%2 \n\ % jnc 7f \n\ % call dummy_accum \n\ % jmp 8f \n\ % 7: \n\ % cmpxchgl %2,%%ds:%0 \n\ % jne 1b \n\ % 8: ; \n\ % " % : "=m" (*c), /* 0 */ % "=a" (exp), /* 1 */ % "=r" (src) /* 2 */ % : "r" ((uint32_t)inc), /* 3 */ % "r" ((uint32_t)(inc >> 32)), /* 4 */ % "m" (*c) /* 5 */ % ); % #endif % } % % uint64_t mycounter[1]; % uint32_t my32counter[1]; % % int % main(void) % { % unsigned i; % % for (i = 0; i < 2010168339; i++) /* sysctl -n machdep.tsc_freq */ % #if 0 % counter_u64_add(mycounter, 1); % #elif 1 % alt_counter_u64_add(mycounter, 1); % #else % counter_u32_add(my32counter, 1); % #endif % printf("%ju\n", (uintmax_t)mycounter[0]); % return (mycounter[0] == 2010168339 ? 0 : 1); % } Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130622110352.J2033>