From owner-svn-src-head@FreeBSD.ORG Tue Jun 25 02:45:45 2013 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 9D692472; Tue, 25 Jun 2013 02:45:45 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 4A0B41D70; Tue, 25 Jun 2013 02:45:44 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id C1A291040E8B; Tue, 25 Jun 2013 12:45:37 +1000 (EST) Date: Tue, 25 Jun 2013 12:45:36 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: svn commit: r252032 - head/sys/amd64/include In-Reply-To: <20130624170849.GH91021@kib.kiev.ua> Message-ID: <20130625102023.K899@besplex.bde.org> References: <20130621081116.E1151@besplex.bde.org> <20130621090207.F1318@besplex.bde.org> <20130621064901.GS1214@FreeBSD.org> <20130621184140.G848@besplex.bde.org> <20130621135427.GA1214@FreeBSD.org> <20130622110352.J2033@besplex.bde.org> <20130622124832.S2347@besplex.bde.org> <20130622174921.I3112@besplex.bde.org> <20130623073343.GY91021@kib.kiev.ua> <20130623181458.J2256@besplex.bde.org> <20130624170849.GH91021@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=Q6eKePKa c=1 sm=1 a=0l9hOOMEwYoA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=gvJhbHXk4isA:10 a=49DzC_tWGvtQhEJVE94A:9 a=CjuIK1q_8ugA:10 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, Gleb Smirnoff , src-committers@FreeBSD.org, Bruce Evans X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 25 Jun 2013 02:45:45 -0000 On Mon, 24 Jun 2013, Konstantin Belousov wrote: > On Sun, Jun 23, 2013 at 07:57:57PM +1000, Bruce Evans wrote: >> The case that can't be fixed by rereading the counters is when fetching >> code runs in between the stores. If the stores are on a another CPU >> that is currently executing them, then we can keep checking that the >> counters don't change for the maximum time that any pair of stores >> take to complete. This time is quite long and difficult to determine >> (but it can be bounded by hard-disabling interrupts in the storer, and >> limited by prefetching). If the stores are on the same CPU, then we >> have no good way to detect that another thread is in the middle of >> them, or to switch back to that thread. So we currently prevent this >> case using slow methods. > > We are already explicit about the fetched value potentially not > reflecting any real moment value, but indeed, having the fetch result > off by 2^32 is not acceptable. > > We need atomic 64bit loads to fix this, which is provided by the same > cmpxchg8b instruction on the 8-byte aligned elements. Intel guarantees > that 8-byte aligned loads and stores are atomic. > > The following is the prototype for the x86. The other 64bit > architectures are handled exactly like amd64. For 32bit !x86 arches, > i.e. arm, mips32 and powerpc32, some more investigation is needed. That works of course, but is too complicated, machine-dependent and slow for me. (I just checked all the pcpu.h implementations and found that all of them except amd64 and i386 still use the quick and racy PCPU_PTR(member) += (value) for PCPU_ADD(). This has 1 sure race and 1 compiler/machine-dependent race: - PCPU_PTR() is only usable if preemption (resulting in rescheduling to run on a different CPU) is prevented in the caller. Otherwise, the pointer may become invalid immediately iafter it is initialized. - the compiler may generate a racy load-add-store operation for +=. Using PCPU_PTR() makes this a smaller problem than using a thread- local pointer. It prevents the store going to a different CPU than the load. PCPU_GET() and PCPU_SET() have the first of these problems on all arches except amd64 and i386. Even curthread wouldn't work in the SMP case if it has the default implementation of PCPU_GET(curthread). However, it has a non-default implementation on all arches except arm and mips. So even curthread seems to be broken for arm and mips.) The non-optimized avoids these problems using a critical section. I'm not sure that works. Anyway, counter.h shouldn't have to understand pcpu better than pcpu.h does in order to work. The arm implementation of atomic.h is also relevant and is especially interesting. I looked at it after mdf@ pointed out ARM_RAS_START. This seems to only be used by arm in userland. It is even larger and slower and more complicated than critical sections, so it doesn't seem to be useful for counters. In the kernel, arm has several implementations depending on machine capabilities. The ifdefs are hard to understand. On some machine, it seems to use the equivalent of a cmpxchg loop. On others, it just disables interrupts. Disabling interrupts is presumably better than a critical section, and userland has to use ARM_RAS_START for some arches because neither disabling interrupts not critical sections are available in userland. None of this care helps avoid the races in pcpu.h, since pcpu.h intentionally avoids using atomic ops. None of this care helps avoid the races in counter.h or make counter.h efficient, since counter.h uses its own methods. To reach the same quality as atomic.h, counter.h would have to copy half of the ifdefs and methods in atomic.h. The slowness that I care about is only in the counter update. Anything complicated there will be slow. My current best design: - use ordinary mutexes to protect counter fetches in non-per-CPU contexts. - use native-sized or always 32-bit counters. Counter updates are done by a single addl on i386. Fix pcpu.h on arches other than amd64 and i386 and use the same method as there. - counter fetches add the native-sized pcpu counters to 64-bit non-pcpu counters, when the native-size counters are in danger of overflowing or always, under the mutex. Transferring data uses an ordinary atomic_cmpset. To avoid ifdefs, always use u_int pcpu counters. The 64-bit non-pcpu counters can easily be changed to pcpu counters if the API is extended to support pcpu data. - run a daemon every few minutes to fetch all the counters, so that the native-sized counters are in no danger of overflowing on systems that don't run statistics programs often enough to fetch the counters to actually use. > ... > diff --git a/sys/kern/subr_counter.c b/sys/kern/subr_counter.c > index a98ed40..3222881 100644 > --- a/sys/kern/subr_counter.c > +++ b/sys/kern/subr_counter.c > ... > @@ -49,14 +51,8 @@ counter_u64_zero(counter_u64_t c) > uint64_t > counter_u64_fetch(counter_u64_t c) > { > - uint64_t r; > - int i; > > - r = 0; > - for (i = 0; i < mp_ncpus; i++) > - r += *(uint64_t *)((char *)c + sizeof(struct pcpu) * i); > - > - return (r); > + return (counter_u64_fetch_inline(c)); > } With my design: extern struct mtx counter_locks[]; extern uint64_t counters[]; uint64_t r; volatile u_int *p; u_int v; int cnum; cnum = ctonum(c); mtx_lock(&counter_locks[cnum]); /* might not need 1 per counter */ r = counters[cnum]; for (i = 0; i < mp_ncpus; i++) { p = (u_int *)((char *)c + sizeof(struct pcpu) * i); v = *p; /* don't care if it is stale */ if (v >= 0x80000000) { /* Transfer only when above critical level. */ while (atomic_cmpset_rel_int(p, v, 0) == 0) v = *p; /* still don't care if it is stale */ counters[cnum] += v; } r += v; } mtx_unlock(&counter_locks[cnum]); return (r); Mutexes give some slowness in the fetching code, but fetches eare expected to be relatively rare. I added the complication to usually avoid atomic ops at the last minute. The complication might not be worth it. Mutexes can be avoided in the usual case too if there is an atomic op to load the 64-bit counters, but I don't want to depend on that. i386 already has atomic load_acq and store_rel for 64-bit accesses, but these are not available on all arches and I consider their existence on i386 to be a bug. On i386. they use cmpxch8b if available, else disable interrupts. The store_rel one is never used, and the load_acq one is used twice to handle a tiny problem with TSC frequencies >= 2**32 Hz. You didn't even use them here. It seems to be just a micro-optimization to not use them (in the cmpxch8b case, it avoids the lock prefix for the SMP case, and in the critical section case, it allows putting the critical section around the whole loop). Bruce