Date: Sat, 22 Jun 2013 18:58:15 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: svn-src-head@freebsd.org, svn-src-all@freebsd.org, Gleb Smirnoff <glebius@freebsd.org>, Konstantin Belousov <kib@freebsd.org>, src-committers@freebsd.org Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130622174921.I3112@besplex.bde.org> In-Reply-To: <20130622124832.S2347@besplex.bde.org> References: <201306201430.r5KEU4G5049115@svn.freebsd.org> <20130621065839.J916@besplex.bde.org> <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 22 Jun 2013, I wrote: > On Sat, 22 Jun 2013, I wrote: > >> ... >> Here are considerably expanded tests, with noninline tests dropped. >> Summary of times on Athlon64: >> >> simple increment: 4-7 cycles (1) >> simple increment preceded by feature test: 5-8 cycles (1) >> simple 32-bit increment: 4-7 cycles (2) >> correct 32-bit increment (addl to mem): 5.5-7 cycles (3) >> inlined critical section: 8.5 cycles (4) >> better inlined critical section: 7 cycles (5) >> correct unsigned 32-bit inc of 64-bit counter: 4-7 cycles (6) >> "improve" previous to allow immediate operand: 5+ cycles >> correct signed 32-bit inc of 64-bit counter: 8.5-9 cycles (7) >> correct 64-bit inc of 64-bit counter: 8-9 cycles (8) >> -current method (cmpxchg8b): 18 cycles > > corei7 (freefall) has about the same timing as Athlon64, but core2 > (ref10-i386) is 3-4 cycles slower for the tests that use cmpxchg. > >> (4) The critical section method is quite fast when inlined. >> (5) The critical section method is even faster when optimized. This is >> what should be used if you don't want the complications for the >> daemon. > > Oops, I forgot that critical sections are much slower in -current than > in my version. They probably take 20-40 cycles for the best case, and > can't easily be tested in userland since they disable interrupts in > hardware. My versions disable interrupts in software. Re-oops, I was looking at the wrong tree. -current is similar to my version here. Anyway, almost all this is moot. The counter fetch is inherently non-atomic, on i386 so it is pointless to do the counter update atomically at the per-CPU level. Losing the race from non-atomic counter updates is very unlikely since the race window is about 2 nsec wide and occurs only on approximately every 4 billionth update. Losing the race in the fetch is a little more likely. The latter might be handled by a heuristic, and anything for this works for the update too. It seems to be good enough to detect counters going backwards and ignore the new value then. That takes more space for the record of the previous values than I like. Non-atomic fetches from subr_counter.c: % void % counter_u64_zero(counter_u64_t c) % { % int i; % % for (i = 0; i < mp_ncpus; i++) % *(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0; % } Non-atomic stores are more of a problem. Races in stores can create permanently invalid counter values. Perhaps i386 memory ordering is strong enough to help. But I think it doesn't help much. It would locked cmpxchg8b in the i386 update and maybe here too to avoid the races, and the lock would defeat the point of pcpu counters. % % uint64_t % counter_u64_fetch(counter_u64_t c) % { % uint64_t r; % int i; % % r = 0; % for (i = 0; i < mp_ncpus; i++) % r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i); % % return (r); % } This code depends on 64-bit loads being atomic. They just aren't on 32-bit arches. The stores for the updates may go out in any order, and might be stale. The reads here may be in any order, especially since this is written in C. We don't care much about stale values, but want monotonically increasing values. Negative counter increments really shouldn't be supported, since they cause complications everywhere. Here they break the heuristic that a negative-going fetched value means a lost race. So the i386 version be simply "addl; adcl" to memory. Each store in this is atomic at the per-CPU level. If there is no carry, then the separate stores are equivalent to adding separate nonnegative values and the counter value is valid at all times. If there is carry, then the separate stores are equivalent to adding a negative value followed by a larger positive value. The counter transiently goes backwards, but no information is lost and the counter is eventually set to the correct forward-going value. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130622174921.I3112>