Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 31 Oct 2000 00:39:26 +0000 (GMT)
From:      Terry Lambert <tlambert@primenet.com>
To:        mjacob@feral.com
Cc:        dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG
Subject:   Re: MP: per-CPU mbuf allocation lists
Message-ID:  <200010310039.RAA29844@usr05.primenet.com>
In-Reply-To: <Pine.BSF.4.21.0010301412440.32006-100000@beppo.feral.com> from "Matthew Jacob" at Oct 30, 2000 02:14:03 PM

next in thread | previous in thread | raw e-mail | index | archive | help
> > >  	I recently wrote an initial "scratch pad" design for per-CPU mbuf
> > >  lists (in the MP case). The design consists simply of introducing
> > >  these "fast" lists for each CPU and populating them with mbufs on bootup.
> > >  Allocations from these lists would not need to be protected with a mutex
> > >  as each CPU has its own. The general mmbfree list remains, and remains
> > >  protected with a mutex, in case the per-CPU list is empty.
> > 
> >    I have only one question - is the lock overhead really so high that this
> > is needed?
> 
> If you know that you can also pre-busdma wrap these lists (which is required
> for full alpha support, and may(?) be for ia64), yes, this makes sense to me
> (at least). I had a friend at Sun not speak to me for years because I didn't
> do this for the Solaris DKI/DDI.

I really, really like per CPU resource pools.  It's one of my SMP
"hot buttons".

I suggest reading the technical white papers at:

	http://www.sequent.com/software/operatingsys/dynix.html

I also recommend "UNIX for Modern Architectures", and chapters
12 and 15 of the Vahalia book:

	UNIX internals: the new frontiers
	Uresh Vahalia
	Prentice Hall
	ISBN: 0-13-101908-2

Chaper 15 has an interesting point about LWPs in SVR4/MP, where
kernel mappings use lazy TLB shootdown, but user mappings use
an immediate shootdown policy, to insure that a mapping going
away in one LWP goes away in other LWPs.

Bonwick, who wrote the Solaris SLAB allocator, indicates in his
paper about the allocator acknowledges that the allocator would
benefit from per-CPU caches, like those in Dynix.

See also:

http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html

Unfortunately, the Winter 1993 proceedings, which has the Dynix
paper, are not online.  The authors of the paper have an insane
number of SMP related patents, including a neat one on "A substantially
zero overhead mutual exclusion apparatus..." that looks like soft
updates applied to concurrency.

Per CPU resource pools with high and low watermarking were part
of the Dynix design.  Here is a passage from Vahalia, chapter 12,
about it (my comments and contextual clues are bracketed):

    12.9 A Hierarchical Allocator for Multiprocessors

    Memory allocation for a shared-memory multiprocessor raises
    some additional concerns.  Data structures such as free lists
    and allocation bitmaps used by traditional systems are not
    multiprocessor-safe and must be protected by locks.  In large,
    parallel systems, this results in heavy contention for these
    locks, and CPUs frequently stall while waiting for these locks
    to be released.

    One solution to this problem is implemeneted in Dynix, a
    multiprocessor UNIX variant for the Sequent S2000 machines.
    It uses a hierarchical allocation scheme that supports the
    System V programming interface.  The sequent multiprocessors
    are used in large on-line transaction-processing environments,
    and the allocator performs well under that load.

    Figure 12-9 [omitted] describes the design of the allocator.
    The lowest (per-CPU) layer allows the fastest operations,
    while the highest (coalesce-to-page) layer is for the time-
    consuming coalescing process.  There is also (not shown) a
    coalesce-to-vmblock layer, which manages page allocation
    within large (4MB-sized) chunks of memory.

[the diagram has a middle global layer, from which the per-CPU
 caches are refilled]

    The per-CPU layer manages one set of power-of-two pools for
    each processor.  These pools are insulated from the other
    processors, and hence can be accessed without acquiring
    global locks.  Allocation and release are fast in most cases,
    as only the local freelist is involved.

    Whenever the per-CPU freelist becomes empty, it can be
    replenished from the global layer, which maintains its own
    power-of-two pools.  Likewise, excess buffers in the per-CPU
    cache can be returned to the global free list.  As an
    optimization, buffers are moved between these two layers in
    target-sized groups (three buffers per move in the case
    shown in Figure 12-9), preventing unnecessary linked-list
    operations.

    To accomplish this, the per-CPU layer maintains two free
    lists, main and aux.  Allocation and release primarily
    use the main free list.  When this becomes empty, the
    buffers on aux are moved to main, and the aux list is
    replenished from the global layer.  Likewise, when the main
    list overflows (size exceeds target), it is moved to aux,
    and the buffers on aux are returned to the global layer.
    This way the global layer is accessed at most once per
    target-number of accesses.  The value of target is a
    tunable parameter.  Increasing target reduces the number
    of global accesses, but ties up more buffers in per-CPU
    caches.

...

    12.9.1 Analysis

    The Dynix algorithm provides efficient memory allocation
    for shared memory multiprocessors.  It supports the standard
    System V interface, and allows memory to be exchanged between
    the allocator and the paging system.  The per-CPU caches
    reduce the contention on the global lock, and the dual free
    lists provide a fast exchange of buffers between the per-CPU
    and global layers.

    It is interesting to contrast the Dynix coalescing approach
    with that of the Mach zone-based allocator.  The Mach
    algorithm employs a mark-and-sweep method, linearly scanning
    the entire pool each time.  This is computationally expensive,
    and hence is relegated to a separate background task.  In
    Dynix, each time blocks are released [from the global layer]
    to the coalesce-to-page layer,the per page data structures are
    updated to accont for them.  When all the buffers in a page
    are freed, the page can be returned to the paging system.
    This happens in the foreground, as part of the processing of
    release operations.  The incremental cost for each release
    operation is small; hence it does not lead to unbounded worst
    case performance [unlike the Mach allocator].


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200010310039.RAA29844>