From owner-freebsd-arch@FreeBSD.ORG Fri Jul 20 14:59:55 2012 Return-Path: Delivered-To: arch@freebsd.org Received: from mx2.freebsd.org (mx2.freebsd.org [IPv6:2001:4f8:fff6::35]) by hub.freebsd.org (Postfix) with ESMTP id 7DA111065670; Fri, 20 Jul 2012 14:59:55 +0000 (UTC) (envelope-from melifaro@FreeBSD.org) Received: from dhcp170-36-red.yandex.net (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx2.freebsd.org (Postfix) with ESMTP id 48E7C14D85C; Fri, 20 Jul 2012 14:59:54 +0000 (UTC) Message-ID: <500971F1.5070907@FreeBSD.org> Date: Fri, 20 Jul 2012 18:57:53 +0400 From: "Alexander V. Chernikov" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:13.0) Gecko/20120627 Thunderbird/13.0.1 MIME-Version: 1.0 To: arch@freebsd.org Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Gleb Smirnoff Subject: Dynamic Per-cpu allocator X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jul 2012 14:59:55 -0000 Hello list! It seems it is time to discuss dynamic part of pcpu allocator. We already have great static one, permitting to statically define per-cpu counters/structures in the source code including modules. (Just in case, it uses DPCPU_ macro and resides in sys/sys/pcpu.h) However, this is not enough since there are many non-sigleton objects requiring dynamic per-cpu counters allocations. Networking stack is definitely candidate for using such api (interface counters, netgraph nodes counters, global / per-protocol statistics (it seems existing DPCPU macro can be used for the latter)). My routing performance tests shows, that eliminating contested counters can give quite significant speed improvement. For example, after removing interface counters and IP statistics (~ 11 counters total) forwarding speed increases from 2MPPS (2 millions packets/sec) to 3MPPS. There are more details about this particular test in http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032714.html On the other side, PoC ipfw per-cpu counters implementation shows no observable overhead between enabled/disabled rule counters: http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032824.html Preemption is not disabled here (typically either netisr thread or isr routine is already cpu-bound) Disabling preemption via critical_enter() gives us 80kpps drop (with one counter): http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032835.html It seems that there is no reason in precise accounting for total number of bytes forwarded (or fastforwarded). On the other side, one may want to account interface bytes/packes. So, what do we need for networking stack (from my point of view): 1) Ability to allocate single pcpu counter (various ng* nodes) 2) Ability to allocate arbitrary structures (per-VNET protocol statistics) 3) Ability to allocate either contiguous linear pool for objects or uma-like allocation (per-interface counters) 4) Ability to use either fast (non-protected) or precise updating What others do: I've found nothing related in OpenSolaris and DragonFly (maybe someone?) Good observation of Linux API: http://www.makelinux.net/ldd3/chp-8-sect-5 Proposed API (not even a draft, just to start discussion with something): We already have DPCPU_ macro for "static" data, but I'm not sure if we can keep the same names for dynamic data. We can add 1) * DPCPU_ALLOC_CNTR() * DPCU_FREE_CNTR() (not sure if existing DPCPU_ macro can be used) 2 + 3) /* * Allocate structure (or several items) of total size "size" with * given alignment "align" and malloc flags "flags. * Returns: * array of pointers (with mp_maxid or MAXCPU size) to per-cpu data: +------------------------------------------- |pcpu0 pcpu1 pcpu2 pcpu3 .pcpuN.. +------------------------------------------- + + + + +------+ +------+ +------+ +------+ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | +------+ +------+ +------+ +------+ */ (void *) DPCPU_ALLOC(size_t size, int align, int flags) DPCPU_FREE(void *) /* * Returns typed pointer to per-cpu data block. * Disables preemption */ type *DPCPU_GET(void *, type) /* * Enables preemption again */ DPCPU_PUT(void *) /* * Returns typed pointer to per-cpu data block without * disabling preemption */ DPCPU_GET_FAST(void *, type) DPCPU_PUT_FAST(void *) /* No-op */ /* * Get remote cpu value */ DPCPU_GET_REMOTE(void *, type, index) /* Use CPU_FOREACH for summary */ -- WBR, Alexander