Date: Mon, 1 Apr 2013 15:51:28 +0400 From: Gleb Smirnoff <glebius@FreeBSD.org> To: arch@FreeBSD.org Subject: [CFR][CFT] counter(9): new API for faster and raceless counters Message-ID: <20130401115128.GZ76816@FreeBSD.org>
next in thread | raw e-mail | index | archive | help
Hi! Together with Konstantin Belousov (kib@) we developed a new API that is initially purposed for (but not limited to) collecting statistical data in kernel. Right now in kernel a statistical counter not protected by a lock is updated either using basic '+=' operator, or via an atomic(9) operation. The first case is prone to data loss due to simultaneous update by more than one CPU. The second case doesn't lose data, but locks memory bus to establish atomic operation, effectively blocking other CPUs. Both cases invalidate cache line associated with the address being updated, thus provoking cache miss when other CPU will update a counter. The situation with cache misses shows itself in situations when multiple threads running on multiple CPUs access mostly their private dataset, and a counter is the only write-shared data between them. Example of such situation is packet forwarding, where struct ipstat and interface byte/packet counters are the data that experiences excessive cache misses. This was found by Alexander Chernikov (melifaro@) and Andrey Elsukov (ae@) at Yandex. They report that removal of stat updates raises benchmarking results in packet forwarding significantly. The decision is to make statistical counters private to CPUs. To achieve that, the following steps were taken: o uma(9) got support for UMA_ZONE_PCPU zones. A slab in UMA_ZONE_PCPU has mp_ncpu slabs following it in virtual memory, so that each CPU has its own slab. Allocation from such zone gives caller the base allocation address, and private address for a CPU can be calculated using base address + curcpu * slab size. The slab size in UMA_ZONE_PCPU is sizeof(struct pcpu). This sizing is required for optimisation described below. Access to private member is coded in tiny inline zpcpu_get(). Note that UMA_ZONE_PCPU zones aren't dedicated for counter(9), they can be utilized for other purposes. You may use them for any kind of dynamic per-CPU data you need. o Tiny API for counter(9): counter_u64_t counter_u64_alloc(int wait); void counter_u64_free(counter_u64_t cnt); void counter_u64_add(counter_u64_t cnt, uint64_t inc); uint64_t counter_u64_fetch(counter_u64_t cnt); o MI implementation of counter_u64_add() is: critical_enter(); *(uint64_t *)zpcpu_get(c) += inc; critical_exit(); o Since distance between members has special sizing, on 64-bit architectures that allow base+offset addressing, namely amd64, it is possible to perform update in single lockless intruction: __asm __volatile("addq\t%1,%%gs:(%0)" : : "r" ((char *)c - (char *)&__pcpu[0]), "r" (inc) : "memory", "cc"); Here we exploit the fact that %gs always points to the current CPU static per-CPU data. And alignment of static per-CPU structures matches alignment of slabs in an UMA_ZONE_PCPU zone. Author of this idea is Konstantin (kib@). I've got a simple benchmark. A syscall that solely updates a counter is implemented. Number of processes is spawned, equal to number of cores, each process binds itself to a dedicated CPU and calls the syscall 10^6 times and exits. Parent wait(2)s for them all and I measure real time of the parent. Here are results for my 4-core Xeon E5620: x new counter(9), resulting counter == mp_ncpus * 10^6 + racy '+=', resulting counter ~25% less than expected * atomic(9), resulting counter == mp_ncpus * 10^6 +------------------------------------------------------------------------------+ |x + *| |x + *| |x + *| |x + *| |x + *| |x + *| |x + *| |x + *| |x + *| |x + *| |A A A| +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 17.65 17.68 17.66 17.665 0.0097182532 + 10 18.53 18.56 18.55 18.543 0.011595018 Difference at 95.0% confidence 0.878 +/- 0.0100517 4.97028% +/- 0.0569016% (Student's t, pooled s = 0.0106979) * 10 34.15 34.2 34.16 34.171 0.020789955 Difference at 95.0% confidence 16.506 +/- 0.0152473 93.439% +/- 0.0863138% (Student's t, pooled s = 0.0162275) So, we got 5% speedup comparing to racy counter and 93% speedup comparing to atomic(9). If binding of processes to CPUs is omitted, results are: x new counter(9), resulting counter == mp_ncpus * 10^6 + racy '+=', resulting counter ~20% less than expected * atomic(9), resulting counter == mp_ncpus * 10^6 +------------------------------------------------------------------------------+ | x + * | | x x x + + + + + + ** * | |xxxx x +x xxxxxx *+ x + +++++ + + *** * * *| | |______A__M___| |_______A_M____| |__A__| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 20 19.04 23.94 21.83 21.3555 1.372836 + 20 20.69 27.33 25.44 24.902 1.4900957 Difference at 95.0% confidence 3.5465 +/- 0.916971 16.607% +/- 4.29384% (Student's t, pooled s = 1.43267) * 10 31.97 33.77 32.85 32.819 0.63079227 Difference at 95.0% confidence 11.4635 +/- 0.940783 53.6794% +/- 4.40534% (Student's t, pooled s = 1.18608) So, we got 16.6% speedup comparing to racy counter and 53.7% speedup comparing to atomic(9). (Note that dispersion of results is much higher, when binding is omitted. Not perfect scheduling in ULE?) Results may vary depending on number of CPUs and cache architecture, not speaking about actual usage of counter(9). Andrey (ae@) had benchmarked RX side of networking stack, where the IP statistics and TCP statistics were converted to counter(9). He reports that although maximum pps didn't increase measurably, the CPU utilisation had dropped from 94% to 45%. Probably something else in the stack prevented to receive more pps, despite of more CPU cycles available. The code is in subversion branch: http://svn.freebsd.org/base/projects/counters/ If there are no objections during this week, I'd like to merge the code to head. -- Totus tuus, Glebius.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130401115128.GZ76816>