Date: Fri, 21 Jun 2013 17:54:27 +0400 From: Gleb Smirnoff <glebius@FreeBSD.org> To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org> Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130621135427.GA1214@FreeBSD.org> In-Reply-To: <20130621184140.G848@besplex.bde.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce, On Fri, Jun 21, 2013 at 09:02:36PM +1000, Bruce Evans wrote: B> Not if it is a 32-bit increment on 32-bit systems, as it should be. B> B> I said to use a daemon to convert small (16 or 32 bit) counters into B> larger (32 or 64 bit) ones. It is almost as efficient to call the B> accumulation code directly: B> B> addl %1,%fs:(%0) B> jns 1f B> call counter_accum B> 1: B> B> This calls the accumulation code when the counter goes above 0x80000000. B> The accumulation code then uses cmpxchg8b or a critical section to B> atomically transfer everything in the small counter to the large B> counter. Note that the small counter remains valid at all times unless B> it is incrememnted too fast. Code that reports the counters must also B> use cmpxch8b or a critical section to read them atomically before B> summing them. The transfer window between when the small counter B> reaches 0x80000000 and when it overflows is very large provided large B> increments are never made (2 increments of 0x80000000 would break it B> immediately, but an average increment of 1 takes about a minute to B> reach 0x80000000 even if called from a loop doing nothing except the B> increment). Only 32-bit increments that are not too large are B> supported. Negative increments are handed by a heuristic in the B> accumulator. E.g., an increment of -1 just works unless it causes the B> small counter's value to go from 0 to 0xfffffffff. 0xffffffff is B> detected by the sign test. It must be interpreted as -1 to be added B> to the large counter, not 0xffffffff. The full heuristic would be B> something like: B> B> if (small_counter >= 0xf0000000) { B> /* Subtract from large counter. */ B> /* Actually do this atomically, of course. */ B> large_counter += (int32_t)small_counter; B> small_counter = 0; B> /* B> * Add here sanity checks that the large counter never goes B> * negative. This can only be afforded because this code B> * is rarely called (and is not inline). B> */ B> } else if (small_counter < 0x90000000) { B> /* Add to large counter. B> large_counter += small_counter; B> small_counter = 0; B> } else { B> panic("transfer window too small"); B> } B> B> > Entering critical section means modifying curthread, which is again B> > a %gs based load & store. Exiting critical section has the same cost. B> > Thus, we assume that doing a direct %gs based update on the counter B> > is cheaper than critical_enter(); counter += foo; critical_exit(); B> B> Critical sections actually only modifiy td->td_critnest, so they do B> a single %fs-based load and 2 %ds-based modifications (it's %fs for B> i386 and %gs for amd64). So it's far from clear that the direct B> update is faster than the generic i386 code. Avoiding the generic B> i386 code also takes an extra load, test and branch. The generic B> code reduces to something like the following instructions after inlining B> the critical section: B> B> movl %gs:CURTHREAD,%ebx B> incl TD_CRITNEST(%ebx) B> movl counter_arg,%edi B> movl inc_lo,%eax B> movl inc_hi,%edx B> addl %eax,%fs:(%edi) B> adcl %edx,%fs:4(%edi) B> 1: B> decl TD_CRITNEST(%ebx) B> B> while the non-generic code for the best case when cmpxchg8b is supported is B> something like: B> B> testl $CPUID_CX8,cpu_feature B> je 2f B> movl counter_arg,%esi B> movl ptr_to_inc,%edi # might need more insns to init *ptr B> // 8 instructions here from the asm string. B> jmp 3f # actually rearrange to jump in other case B> 2: B> // Use critical section. B> 3: B> B> B> So using cmpxchg8b takes at least 4 more instructions than using a B> critical section. The incl/decl instructions on td_critnest are rather B> slow, however. According to my test, they take 6 cycles each. (I B> think they can be reduced to load; modify; store; ... store back B> original value which should take more like 9 cycles.) This probably B> applies to the addl and adcl that do the work too. 12 cycles for them, B> and just 2 more cycles (pipelined) for cmpxchg8b and all the other B> setup instructions. B> B> Now, how does the performance aspect work? Is it by cmpxch8b causing B> less contention than direct addl to memory? If so, then hopefully B> plain cmpxchg does the same. B> B> The non-cmpxchg version of the i386 code can probably be improved by B> adding more instructions to it, to avoid doing the adcl to memory B> in most cases. Even if you support 64-bit increments, they will rarely B> be used, so the branches for this will be predicted well: B> B> // Next 4 instructions as above: B> movl counter_arg,%edi B> movl inc_lo,%eax B> movl inc_hi,%edx B> addl %eax,%fs:(%edi) B> jnc 1f # usually no carry B> adcl %edx,%fs:4(%edi) B> jmp 3f B> 1: B> testl %edx,%edx B> je 3f # usually no carry and no high bits B> addl %edx,%fs:4(%edi) B> B> I wonder if cmpxch8b is smart enough to do this automatically, and that is B> the reason it is faster. It could avoid writing to the high word if the B> high word doesn't change. Modern x86 has write buffers that help it B> do 64-bit write accesses in 32-bit mode no slower than in 64-bit mode, provided B> the CPU can issue store instructions fast enough. Caches do the same B> thing for load instructions on not-so-modern x86. But often, the CPU B> can't issue load/store instructions fast enough. I suspect that the B> addl mem; adcl 4+mem sequence is a bad way to do a 64-bit increment, and B> load 64 bits; modify; store 64 bits is better. cmpxchg8b can also issue B> a 64-bit store. B> B> Testing gave the following results: B> - jumping over adcl as above saves almost the full cost of an adcl B> - load 64-bits, unconditional 64-bit add from memory to registers; store-64 B> bits is about 25% faster than addl; adcl to memory. Testing now on B> Athlon64. addl and adcl to memory each take about 7 cycles; 14 for B> both; only 11 for doing the 64-bit add via registers. Gcc is not B> smart about this. It generates addl; adcl to memory. Your asm has B> to do the increment in registers for technical reasons, so it does B> better than gcc almost (?) accidentally. But clang is smart about B> this. It generates the larger code that goes through registers even B> with -mi386 where I doubt that it is better (original i386 had no B> cache and slow I-fetch, so I think it strongly preferred less B> instructions). Your code for the critical section case doesn't use B> asm, so it is slower than necessary with gcc. When you found that B> simple increments were slower in userland tests, was that with gcc B> or clang? B> - jumping over adcl in the second method didn't seem to work. May be you are right and an implementation with a "daemon" would be more efficient for i386. But daemon isn't needed at all on 64-bit platforms. So I doubt it is worth coding it. Frankly speaking, we already do not consider the i386 as a platform to build high performance solutions. The need for counter(9) in TCP/IP stack was demonstrated on modern amd64 hardware. Thus, our motivation was to make: - a lossless and faster implementation for amd64 - a lossless implementation for !amd64 - if we have an option to avoid critical section on some arch, good let's avoid it. -- Totus tuus, Glebius.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130621135427.GA1214>