Date: Wed, 26 Jun 2013 11:42:39 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Konstantin Belousov <kostikbel@gmail.com> Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, Gleb Smirnoff <glebius@FreeBSD.org>, src-committers@FreeBSD.org, Bruce Evans <brde@optusnet.com.au> Subject: Re: svn commit: r252032 - head/sys/amd64/include Message-ID: <20130626092955.B891@besplex.bde.org> In-Reply-To: <20130625205826.GM91021@kib.kiev.ua> References: <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> <20130623181458.J2256@besplex.bde.org> <20130624170849.GH91021@kib.kiev.ua> <20130625102023.K899@besplex.bde.org> <20130625062039.GJ91021@kib.kiev.ua> <20130625190352.P986@besplex.bde.org> <20130625205826.GM91021@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, 25 Jun 2013, Konstantin Belousov wrote: > On Tue, Jun 25, 2013 at 08:14:41PM +1000, Bruce Evans wrote: >> On Tue, 25 Jun 2013, Konstantin Belousov wrote: >> >>> Updates to the counter cannot be done from the interrupt context. >> >> This is fragile, however. It prevents using counters for things like >> counting interrupts. Most interrupt counting is now done directlyly >> and doesn't use PCPU_INC(). i386/i386 has just 1 use of PCPU_INC(). >> It is to count traps in machdep.c. Traps include nonmaskable >> interrupts. Even hard-disabling of interrupts doesn't prevent these. >> ... > Traps are not performance critical in the sense that there is no need to count > up to 1-10G traps per second. The trap count increment is not a performance problem, but a correctness one. Disabling interrupts for a multi-word increment doesn't work for NMIs even for !SMP or pcpu accesses with SMP. > Anyway, as Gleb said, there is no point in > optimizing the i386 kernel. I said that there is every point in optimizing the i386 kernel. This applies even more to other 32-bit arches. Some CPUs are much slower than modern x86's. They shouldn't be slowed down more by inefficient KPIs. >>> The fetch problem of sometime missing the carry bit is present on all >>> 32bit arches now, I think. >> >> Also the zeroing problem. Since there are no atomic ops in sight, zeroing >> has several races. My transferring code uses atomic ops on only 1 CPU, >> so it has half of these races. See below. > Right, zeroing should also use atomic 64bit writes on i386. Updated > patch is at end. It's getting even more complicated and further from what I want. It also doesn't work. Disabling interrupts doesn't work for the SMP case. The zeroing needs to run in a per-CPU daemon. >>>> The arm implementation of atomic.h is also relevant and is especially >>>> interesting. I looked at it after mdf@ pointed out ARM_RAS_START. >>>> This seems to only be used by arm in userland. It is even larger and >>>> slower and more complicated than critical sections, so it doesn't seem >>>> to be useful for counters. In the kernel, arm has several >>>> implementations depending on machine capabilities. The ifdefs are >>>> hard to understand. On some machine, it seems to use the equivalent >>>> of a cmpxchg loop. On others, it just disables interrupts. Disabling >>>> interrupts is presumably better than a critical section, and userland >>>> has to use ARM_RAS_START for some arches because neither disabling >>>> interrupts not critical sections are available in userland. None >>>> of this care helps avoid the races in pcpu.h, since pcpu.h intentionally >>>> avoids using atomic ops. None of this care helps avoid the races in >>>> counter.h or make counter.h efficient, since counter.h uses its own >>>> methods. To reach the same quality as atomic.h, counter.h would have >>>> to copy half of the ifdefs and methods in atomic.h. >>> BTW, why do you claim that disabling interrupts is faster and better method >>> to disable preemption than critical section ? >> >> For arm, it is because if the critical section method were better, then >> arm should use it for atomic ops too. > Of course not. Atomics must be atomic WRT the interrupts as well. > Using critical sections only provides atomicity against preemption > but not against interruption. Indeed, atomics would be fragile if they couldn't be used in interrupt handlers. So in a non-optimized implementation it is safe to disable interrupts in hardware. An optimized solution would accept the fragility like the counter increment does, and have use special atomic ops that do more for the few variables that are accessed by interrupt handlers. > The arm looks quite deficient anyway. From the arm/include/atomic.h, > it seems that the arch did not have (usable) atomics until v6. And > ARM does not support SMP in the stock HEAD still. Disabling interrupts only gives atomicity for non-SMP. Thus the cases where this method is used indicate an arch that doesn't support SMP. More interestingly for the current discussion, arm and other several other arches apparently don't have interrupt-safe load-modify-store instructions even for register-sized memory variables. Thus they need to either disable interrupts or use a cmpxchg-type instruction even to add to register-sized variables. >> % static inline void >> % counter_u64_add(counter_u64_t c, int64_t inc) >> % { >> % >> % #if 1 >> % /* 20 cycles on A64. */ >> % /* 31 cycles on A64 with lock prefix. */ >> % /* 35 cycles on A64 with mfence prefix. */ >> % if ((cpu_feature & 1) == 1) { >> % counter_64_inc_8b(c, inc); >> % } >> >> Old tests said that this takes 18 cycles. It now takes 20. >> >> Adding a lock or mfence prefix to the cmpxchg8b gives the times stated >> above. This shows how bad using cmpxch8b is. We could use atomic ops >> for everything and only be 50% slower. But we shouldn't use either >> cmpxchg8b or atomic ops. We should use 32-bit addl, which takes about >> 4 cycles. > Use of cmpxchg8b provides the soundness to the algorithm, by using > atomic 64 bit loads and stores. So does using a correct algorithm. > As I stated, I am not going to spent time microoptimizing for i386. > Current implementation is good enough. I disagree. I'm not really trying to micro-optimize it, but to avoid pessimizing it. To replaces PCPU_ADD on register-sized variables, you meed to be at least as efficient as well as more correct. Arches with slower CPUs benefit even more from not pessimizations imposed by MI KPIs. [>]*... >>> I will add the missed MD bits to the current patch for i386, leaving >>> the known issue for fetches on other 32bit arches for now. >> >> Also leaving the known issue of zeroing being broken for all arches. >> Using a daemon is a good way to fix this too. You have to run the >> daemon on each CPU and wait for it to complete whenever a counter >> is zeroed. > Let me restate. If somebody wants to provide a complete working > implementation for i386 which is more optimized than cmpxchg8b, I > have no objections. I will not spent a time on this, though, since I > consider such activity quite useless. And, I should note that e.g. the > daemon-based implementation must be arch-specific, to not pessimise > amd64. No, the daemon runs rarely, so its efficiency doesn't matter. > [Skipped the test program] > I retested several cases on SandyBridge i7 machine. There, the cmpxchg8b > version without barrier or lock gives 12 cycles, the incomplete > emulation of the critical section is executed in 7 cycles, and cli/sti > in 24. cli/sti seems to be slow on SandyBridge. Not too surprising. I checked the old Athlon manual for timing. It confirms what I measured: cli: 4 cycles sti: 4 cycles (so using cli/sti gives 12) pushfl: 4 cycles popfl: 15 cycles (so this is too slow to use; using it gave 19) freefall (corei7) has much the same timing as you measured for the others. It takes another cycle (8 altogether) for the complete emulation of the critical section. But a simple register-sized add-to-memory with daemon support is better. It takes 4 cycles on modern and not-so-modern x86. It also gives less register pressure on 32-bit systems. It also allows expansion or shrinking to any size of counter (larger than the register or int size) without changing the MD part of the KPI, with small modifications to the daemon and fetching code. > diff --git a/sys/amd64/include/counter.h b/sys/amd64/include/counter.h > index b37a4b8..95e0c58 100644 > --- a/sys/amd64/include/counter.h > +++ b/sys/amd64/include/counter.h > @@ -36,6 +36,37 @@ extern struct pcpu __pcpu[1]; > + ... > +static inline void > +counter_u64_zero_inline(counter_u64_t p) > +{ > + int i; > + > + for (i = 0; i < mp_ncpus; i++) > + *(uint64_t *)((char *)p + sizeof(struct pcpu) * i) = 0; > +} > +#endif > + > #define counter_u64_add_protected(c, i) counter_u64_add(c, i) This is the same as the MI zeroing that it replaces, so it has the same bugs: - it stores to pcpu memory from a non-pcpu context (unless it is called from a pcpu daemon). Since nothing uses locked instructions, the other CPUs might not see the changes here. When the changes are made concurrently, there is a race to decide the final value. It may be the value stored here, or the old value plus an increment stored in pcpu context. This gives an error of up to 2**64-1. - it assumes that the compiler generates 64-bit accesses for 64-bit stores. Implement and use the -fperverse-stores option to check for such bugs :-). It does things like storing multiple times with various sizes and orders and many misaligned sub-objects. counter_u64_read_one_8b() on amd64 depends on -fno-pervervse-loads. > diff --git a/sys/i386/include/counter.h b/sys/i386/include/counter.h > index 3e93b36..de90d85 100644 > --- a/sys/i386/include/counter.h > +++ b/sys/i386/include/counter.h > @@ -67,6 +67,85 @@ counter_64_inc_8b(uint64_t *p, int64_t inc) > : "memory", "cc", "eax", "edx", "ebx", "ecx"); > } > > +#ifdef IN_SUBR_COUNTER_C > +static inline uint64_t > +counter_u64_read_one_8b(uint64_t *p) > +{ > + uint32_t res_lo, res_high; > + > + __asm __volatile( > + "movl %%eax,%%ebx\n\t" > + "movl %%edx,%%ecx\n\t" > + "cmpxchg8b (%2)" > + : "=a" (res_lo), "=d"(res_high) > + : "SD" (p) > + : "cc", "ebx", "ecx"); > + return (res_lo + ((uint64_t)res_high << 32)); > +} This is sufficiently MD, so it works with -fperverse-loads. We accept the minor races that it gives. Locking for read is nonsense in general since values may become stale immediately after they are read. But we especially don't care about reading stale values for counters. I think x86 memory ordering guarantees that the stale values don't go backwards by becoming staler. We care about them having this property. > +static inline uint64_t > +counter_u64_fetch_inline(uint64_t *p) > +{ > + uint64_t res; > + int i; > + > + res = 0; > + if ((cpu_feature & CPUID_CX8) == 0) { > + /* > + * The machines without cmpxchg8b are not SMP. Should assert this somewhere. > + * Disabling the preemption provides atomicity of the > + * counter reading, since update is done in the > + * critical section as well. > + */ > + critical_enter(); > + for (i = 0; i < mp_ncpus; i++) { > + res += *(uint64_t *)((char *)p + > + sizeof(struct pcpu) * i); > + } > + critical_exit(); Seems OK. Works with -fperverse-loads. > +static inline void > +counter_u64_zero_one_8b(uint64_t *p) > +{ > + > + __asm __volatile( > + "movl (%0),%%eax\n\t" > + "movl 4(%0),%%edx\n" > + "xorl %%ebx,%%ebx\n\t" > + "xorl %%ecx,%%ecx\n\t" > +"1:\n\t" > + "cmpxchg8b (%0)\n\t" > + "jnz 1b" > + : > + : "SD" (p) > + : "memory", "cc", "eax", "edx", "ebx", "ecx"); > +} Better than for amd64. It works with -fperverse-stores, but has the race accessing pcpu data from another CPU. > + > +static inline void > +counter_u64_zero_inline(counter_u64_t c) > +{ > + int i; > + > + if ((cpu_feature & CPUID_CX8) == 0) { > + critical_enter(); > + for (i = 0; i < mp_ncpus; i++) > + *(uint64_t *)((char *)c + sizeof(struct pcpu) * i) = 0; > + critical_exit(); Works for the !SMP case, as for fetches. > + } else { > + for (i = 0; i < mp_ncpus; i++) > + counter_u64_zero_one_8b((uint64_t *)((char *)c + > + sizeof(struct pcpu) * i)); > + } Inherits the races in counter_u64_zero_one_8b(). This should also try harder to set all the counters to 0 at the same time. Loop a few times if necessary. Not very important, since the result of losing this race is much the same as reading a stale counter. But there is one case that matters, and is not really a race: suppose the first counter increment after zeroing, or even the first counter increment after initialization, is a decrement, say by 1. Suppose that all the other races are fixed, so that the increment occurs strictly after the zeros. Then it gives a counter value of 0xffffffffffffffff. As usual, the best fix is to not use unsigned types for anything. Let the decrement give the value -1. Anything reporting this value should use a signed format and be less surprised by it being -1 than by it being 0xffffffffffffffff or 18446744073709551615. Another fix is to add a "small" bias to counters. Instead of zeroing, set each pcpu counter to the "small" value 0x10000000000000000 divided by the number of CPUs. This should prevent each counter overflowing on decrements. After adding up the counters, convert results that are less than the combined bias to 0. > diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c > ... The full MD fix will be much larger. Critical sections only work for the !SMP case. Loads and stores need something like x86 cmpxch8b and that might not exist on arches that support SMP. It is easiest to use atomic ops for the loads and stores. Efficiency is unimportant for thes accesses since they are rare. By using the arch's atomic ops, you get the best that it can do. All arches that support SMP must have atomic ops for 32-bit loads and stores, but not necessarily for 64-bit loads and stores. This is another reason not to use 64-bit counters in the MI part of the KPI. I already gave an implementation (that doesn't quite work, since it has the same pcpu race as above) using only standard KPIs: with minor updates to it: % struct mtx *mp; % uint64_t *ap; % uint64_t r; % volatile int *p; % int v; % % mp = ctomtxp(c); /* might not need 1 per counter */ % ap = ctoap(c); /* accumulation counter for c */ The accumulation counter is locked by the counter mutex. % mtx_lock(nmp); % r = *ap; % for (i = 0; i < mp_ncpus; i++) { % p = (u_int *)((char *)c + sizeof(struct pcpu) * i); % v = *p; /* don't care if it is stale */ Buggy. Broken with -fperverse-loads. This needs to be a MD access in asm, like you did above, but since it is only 32 bits, it can be a single load instruction. (BTW, the i386 implementation of atomic.h is similarly broken with -fperverse-loads in the !SMP case. It uses asm in the SMP cases, but in the !SMP case it lets the compiler decide the load instruction[s]. It also uses a volatile pointer.) Volatile should stop -fperverse-loads from doing anything, but there are no guarantees. I don't like volatile, since using it mainly hides the bugs that -fperverse-* would find. Here only suitable MD accesses are guaranteed to work. I depend on the cmpset having a memory clobber to tell the compiler to reload *p later. The easiest fix is to use atomic_load_int(). This is MI and the best currently available but unnecessarily slow. Not extremely slow, since this code is rarely executed. Easier than adding pcpu_atomic_load_int() to all arches. PCPU_GET() has an inaccessible KPI, and is known to be buggy. Except for the above bugs, read accesses to *p are "locked" by it being pcpu and by non-pcpu readers like here not caring about its exact value. % if (v >= 0x80000000) { % /* Transfer only when above critical level. */ % while (atomic_cmpset_rel_int(p, v, 0) == 0) % v = *p; /* still don't care if it is stale */ Write accesses to *p need more careful locking. This was supposed to be locked by the atomic op combined with *p being pcpu. But that doesn't work, since this is a non-pcpu access and the pcpu accesses for increments don't use atomic ops. To work, the transfer needs to be done in pcpu context by the daemon. Then it doesn't need atomic ops, but using the same atomic cmpset as above is convenient and not too slow since it rarely executes. It saves having to write pcpu_atomic_cmpset_int() for all arches. % *ap += v; % } % r += v; % } % mtx_unlock(cmptxp); % return (r); Zeroing is similar to transferring. The daemon will have to be kicked so that it runs on each CPU and transfers all counters. This sets all pcpu counters to 0. The settings are transient and the pcpu counters may become nonzero before completion, but usually they won't become very far from 0. Wait for the daemon to complete. Clear the top-level counter (now the API only supports clearing 1) while holding the mutex. Add up all the pcpu counters for the counter being cleared normally, and repeat until the full count is close enough to zero. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130626092955.B891>