Date: Fri, 21 Jun 2013 21:02:36 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Gleb Smirnoff <glebius@FreeBSD.org> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130621184140.G848@besplex.bde.org> In-Reply-To: <20130621064901.GS1214@FreeBSD.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 21 Jun 2013, Gleb Smirnoff wrote: > On Fri, Jun 21, 2013 at 09:04:34AM +1000, Bruce Evans wrote: > B> >> The i386 version of the counter asm doesn't support the immediate > B> >> constraint for technical reasons. 64 bit counters are too large and > B> >> slow to use on i386, especially when they are implemented as they are > B> >> without races. > B> > > B> > Actual testing showed that it is only about twice as slow as a direct > B> > increment. With the enclosed test program (a userland version hacked > B> > on a bit to avoid pcpu), on ref10-i386 the times are: > ... > Yes, for a single threaded userland program using "+=" is faster than > all the magic that counter(9) does. > > But when multiple threads need to access one counter "+=" fails both > with precision and with performance. No. But I forgot about the performance aspect. > Using "+=" upon a per-CPU counter is racy, since += is compiled into > "load", "increment", "store" sequence and if we are not in a critical > section, then this is racy. We might be removed from CPU between load > and store. Not if it is a 32-bit increment on 32-bit systems, as it should be. I said to use a daemon to convert small (16 or 32 bit) counters into larger (32 or 64 bit) ones. It is almost as efficient to call the accumulation code directly: addl %1,%fs:(%0) jns 1f call counter_accum 1: This calls the accumulation code when the counter goes above 0x80000000. The accumulation code then uses cmpxchg8b or a critical section to atomically transfer everything in the small counter to the large counter. Note that the small counter remains valid at all times unless it is incrememnted too fast. Code that reports the counters must also use cmpxch8b or a critical section to read them atomically before summing them. The transfer window between when the small counter reaches 0x80000000 and when it overflows is very large provided large increments are never made (2 increments of 0x80000000 would break it immediately, but an average increment of 1 takes about a minute to reach 0x80000000 even if called from a loop doing nothing except the increment). Only 32-bit increments that are not too large are supported. Negative increments are handed by a heuristic in the accumulator. E.g., an increment of -1 just works unless it causes the small counter's value to go from 0 to 0xfffffffff. 0xffffffff is detected by the sign test. It must be interpreted as -1 to be added to the large counter, not 0xffffffff. The full heuristic would be something like: if (small_counter >= 0xf0000000) { /* Subtract from large counter. */ /* Actually do this atomically, of course. */ large_counter += (int32_t)small_counter; small_counter = 0; /* * Add here sanity checks that the large counter never goes * negative. This can only be afforded because this code * is rarely called (and is not inline). */ } else if (small_counter < 0x90000000) { /* Add to large counter. large_counter += small_counter; small_counter = 0; } else { panic("transfer window too small"); } > Entering critical section means modifying curthread, which is again > a %gs based load & store. Exiting critical section has the same cost. > Thus, we assume that doing a direct %gs based update on the counter > is cheaper than critical_enter(); counter += foo; critical_exit(); Critical sections actually only modifiy td->td_critnest, so they do a single %fs-based load and 2 %ds-based modifications (it's %fs for i386 and %gs for amd64). So it's far from clear that the direct update is faster than the generic i386 code. Avoiding the generic i386 code also takes an extra load, test and branch. The generic code reduces to something like the following instructions after inlining the critical section: movl %gs:CURTHREAD,%ebx incl TD_CRITNEST(%ebx) movl counter_arg,%edi movl inc_lo,%eax movl inc_hi,%edx addl %eax,%fs:(%edi) adcl %edx,%fs:4(%edi) 1: decl TD_CRITNEST(%ebx) while the non-generic code for the best case when cmpxchg8b is supported is something like: testl $CPUID_CX8,cpu_feature je 2f movl counter_arg,%esi movl ptr_to_inc,%edi # might need more insns to init *ptr // 8 instructions here from the asm string. jmp 3f # actually rearrange to jump in other case 2: // Use critical section. 3: So using cmpxchg8b takes at least 4 more instructions than using a critical section. The incl/decl instructions on td_critnest are rather slow, however. According to my test, they take 6 cycles each. (I think they can be reduced to load; modify; store; ... store back original value which should take more like 9 cycles.) This probably applies to the addl and adcl that do the work too. 12 cycles for them, and just 2 more cycles (pipelined) for cmpxchg8b and all the other setup instructions. Now, how does the performance aspect work? Is it by cmpxch8b causing less contention than direct addl to memory? If so, then hopefully plain cmpxchg does the same. The non-cmpxchg version of the i386 code can probably be improved by adding more instructions to it, to avoid doing the adcl to memory in most cases. Even if you support 64-bit increments, they will rarely be used, so the branches for this will be predicted well: // Next 4 instructions as above: movl counter_arg,%edi movl inc_lo,%eax movl inc_hi,%edx addl %eax,%fs:(%edi) jnc 1f # usually no carry adcl %edx,%fs:4(%edi) jmp 3f 1: testl %edx,%edx je 3f # usually no carry and no high bits addl %edx,%fs:4(%edi) I wonder if cmpxch8b is smart enough to do this automatically, and that is the reason it is faster. It could avoid writing to the high word if the high word doesn't change. Modern x86 has write buffers that help it do 64-bit write accesses in 32-bit mode no slower than in 64-bit mode, provided the CPU can issue store instructions fast enough. Caches do the same thing for load instructions on not-so-modern x86. But often, the CPU can't issue load/store instructions fast enough. I suspect that the addl mem; adcl 4+mem sequence is a bad way to do a 64-bit increment, and load 64 bits; modify; store 64 bits is better. cmpxchg8b can also issue a 64-bit store. Testing gave the following results: - jumping over adcl as above saves almost the full cost of an adcl - load 64-bits, unconditional 64-bit add from memory to registers; store-64 bits is about 25% faster than addl; adcl to memory. Testing now on Athlon64. addl and adcl to memory each take about 7 cycles; 14 for both; only 11 for doing the 64-bit add via registers. Gcc is not smart about this. It generates addl; adcl to memory. Your asm has to do the increment in registers for technical reasons, so it does better than gcc almost (?) accidentally. But clang is smart about this. It generates the larger code that goes through registers even with -mi386 where I doubt that it is better (original i386 had no cache and slow I-fetch, so I think it strongly preferred less instructions). Your code for the critical section case doesn't use asm, so it is slower than necessary with gcc. When you found that simple increments were slower in userland tests, was that with gcc or clang? - jumping over adcl in the second method didn't seem to work. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130621184140.G848>