Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Jul 2012 18:57:53 +0400
From:      "Alexander V. Chernikov" <melifaro@FreeBSD.org>
To:        arch@freebsd.org
Cc:        Gleb Smirnoff <glebius@FreeBSD.org>
Subject:   Dynamic Per-cpu allocator
Message-ID:  <500971F1.5070907@FreeBSD.org>

next in thread | raw e-mail | index | archive | help
Hello list!

It seems it is time to discuss dynamic part of pcpu allocator.

We already have great static one, permitting to statically define 
per-cpu counters/structures in the source code including modules.
(Just in case, it uses DPCPU_  macro and resides in sys/sys/pcpu.h)

However, this is not enough since there are many non-sigleton objects 
requiring dynamic per-cpu counters allocations.

Networking stack is definitely candidate for using such api (interface 
counters, netgraph nodes counters, global / per-protocol statistics (it 
seems existing DPCPU macro can be used for the latter)).

My routing performance tests shows, that eliminating contested counters 
can give quite significant speed improvement. For example, after 
removing interface counters and IP statistics (~ 11 counters total) 
forwarding speed increases from 2MPPS (2 millions packets/sec) to 3MPPS.

There are more details about this particular test in
http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032714.html

On the other side, PoC ipfw per-cpu counters implementation shows no 
observable overhead between enabled/disabled rule counters:

http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032824.html

Preemption is not disabled here (typically either netisr thread or isr 
routine is already cpu-bound)

Disabling preemption via critical_enter() gives us 80kpps drop (with one 
counter):
http://lists.freebsd.org/pipermail/freebsd-net/2012-July/032835.html

It seems that there is no reason in precise accounting for total number 
of bytes forwarded (or fastforwarded). On the other side, one may want 
to account interface bytes/packes.

So, what do we need for networking stack (from my point of view):

1) Ability to allocate single pcpu counter (various ng* nodes)
2) Ability to allocate arbitrary structures (per-VNET protocol statistics)
3) Ability to allocate either contiguous linear pool for objects or 
uma-like allocation (per-interface counters)
4) Ability to use either fast (non-protected) or precise updating


What others do:
I've found nothing related in OpenSolaris and DragonFly (maybe someone?)
Good observation of Linux API: http://www.makelinux.net/ldd3/chp-8-sect-5


Proposed API (not even a draft, just to start discussion with something):

We already have DPCPU_ macro for "static" data, but I'm not sure if we 
can keep the same names for dynamic data.

We can add
1)
* DPCPU_ALLOC_CNTR()
* DPCU_FREE_CNTR()
(not sure if existing DPCPU_ macro can be used)

2 + 3)
/*
  * Allocate structure (or several items) of total size "size" with
  * given alignment "align" and malloc flags "flags.
  * Returns:
  * array of pointers (with mp_maxid or MAXCPU size) to per-cpu data:

   +-------------------------------------------
   |pcpu0    pcpu1    pcpu2    pcpu3   .pcpuN..
   +-------------------------------------------
      +        +        +        +
   +------+ +------+ +------+ +------+
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   |      | |      | |      | |      |
   +------+ +------+ +------+ +------+
  */

(void *) DPCPU_ALLOC(size_t size, int align, int flags)
DPCPU_FREE(void *)

/*
  * Returns typed pointer to per-cpu data block.
  * Disables preemption
*/
type *DPCPU_GET(void *, type)

/*
  * Enables preemption again
  */
DPCPU_PUT(void *)


/*
  * Returns typed pointer to per-cpu data block without
  * disabling preemption
  */
DPCPU_GET_FAST(void *, type)

DPCPU_PUT_FAST(void *) /* No-op */

/*
  * Get remote cpu value
  */
DPCPU_GET_REMOTE(void *, type, index)

/* Use CPU_FOREACH for summary */


-- 
WBR, Alexander




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?500971F1.5070907>