From owner-freebsd-arch@FreeBSD.ORG Tue Apr 2 14:56:04 2013 Return-Path: Delivered-To: arch@FreeBSD.org Received: from mx1.freebsd.org (mx1.FreeBSD.org [8.8.178.115]) by hub.freebsd.org (Postfix) with ESMTP id 1A2C4A37; Tue, 2 Apr 2013 14:56:04 +0000 (UTC) (envelope-from luigi@onelab2.iet.unipi.it) Received: from onelab2.iet.unipi.it (onelab2.iet.unipi.it [131.114.59.238]) by mx1.freebsd.org (Postfix) with ESMTP id 95619895; Tue, 2 Apr 2013 14:56:03 +0000 (UTC) Received: by onelab2.iet.unipi.it (Postfix, from userid 275) id 4DEC273027; Tue, 2 Apr 2013 16:57:22 +0200 (CEST) Date: Tue, 2 Apr 2013 16:57:22 +0200 From: Luigi Rizzo To: Gleb Smirnoff Subject: Re: [CFR][CFT] counter(9): new API for faster and raceless counters Message-ID: <20130402145722.GA9161@onelab2.iet.unipi.it> References: <20130401115128.GZ76816@FreeBSD.org> <20130402013758.GA96904@onelab2.iet.unipi.it> <20130402141717.GK76816@glebius.int.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20130402141717.GK76816@glebius.int.ru> User-Agent: Mutt/1.5.20 (2009-06-14) Cc: arch@FreeBSD.org, kib@FreeBSD.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Apr 2013 14:56:04 -0000 On Tue, Apr 02, 2013 at 06:17:17PM +0400, Gleb Smirnoff wrote: > Luigi, > > On Tue, Apr 02, 2013 at 03:37:58AM +0200, Luigi Rizzo wrote: > L> API: ... > L> - there are several places which use multiple counters > L> (e.g. packet and byte counters, global and per flow/socket), > L> so i wonder if it may help to provide a "protected" version of > L> counter_u64_add() that requires the critical_enter/exit > L> only once. Something like ... > Here is patch for review. It adds 4 more primitives: > > counter_enter(); > counter_u64_add_prot(c, x); > counter_u64_subtract_prot(c, x); > counter_exit(); thanks for the patch. I have three more comments: - is it really necessary to have the "subtract" version ? Couldn't one just make "x" an int64_t ? or it gives too many warnings at runtime maybe ? - (this can be fixed later) in the i386 version, counter_enter() and counter_exit() have an if statement which may become quite expensive if mispredicted. Also, the test is repeated 3 times in counter_u64_add() (enter/add/exit). Hopefully the compiler optimizes out the extra calls, but the previous version seemed more readable. Anyways, at some point we should figure out whether putting likely/unlikely annotations on the result of (cpu_feature & CPUID_CX8) may improve performance where it matters. - do you plan to provide an API to initialize a counter to 0 or a specific value ? I suppose this is done implicitly on allocation, but there are cases (e.g. ipfw) where the application explicitly zeroes counters. > L> BENCHMARK: > L> > L> > I've got a simple benchmark. A syscall that solely updates a counter is > L> > implemented. Number of processes is spawned, equal to number of cores, > L> > each process binds itself to a dedicated CPU and calls the syscall 10^6 > L> > times and exits. Parent wait(2)s for them all and I measure real time of > L> > L> - I am under the impression that these benchmarks are dominated > L> by the syscall time, and the new counters would exhibit a lot > L> better relative performance (compared to racy or atomic) > L> by doing 10..100 counter ops per syscall. Any chance to retry > L> the test in this configuration ? > > Ok, as you wish. > > Apparently compiler optimises things like: > > for (int i = 0; i < rounds; i++) > the_counter += v; > > To avoid optimisations here I declared the_counter as volatile. Is the > benchmark fair after that? Anyway, here are results for (rounds == 2000000000): > > x counter_u64_add(), result == 2000000000 * ncpus > + racy increment, result == 2022232241 (!!!) > +------------------------------------------------------------------------------+ > | x + | > | x + | > | x ++ | > | x ++ | > | x x +++ +| > ||_MA__| |_MA_| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 5 5.44 5.01 5.053 0.13605963 > + 10 8.16 8.53 8.2 8.235 0.10721215 > Difference at 95.0% confidence > 3.182 +/- 0.115089 > 62.9725% +/- 2.27764% > (Student's t, pooled s = 0.122488) > > So 63% speedup, not speaking on the fact that in such a tight loop 98% of > parallel updates are lost on racy counter :) > > A tight loop with atomic_add() is 22 times (twenty two times) slower than > new counter. I didn't bother to run ministat :) yes i think this really makes justice of the improvements of the new code (i am a bit unclear on what actual test you ran / how many counter_u64_add() per syscall you have, but i do expect the racy counters to be much slower and much less reliable, and the 20x slowdown with atomics is completely expected.) cheers luigi