Date: Tue, 31 Oct 2000 00:39:26 +0000 (GMT) From: Terry Lambert <tlambert@primenet.com> To: mjacob@feral.com Cc: dg@root.com (David Greenman), bmilekic@dsuper.net (Bosko Milekic), freebsd-net@FreeBSD.ORG, freebsd-arch@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <200010310039.RAA29844@usr05.primenet.com> In-Reply-To: <Pine.BSF.4.21.0010301412440.32006-100000@beppo.feral.com> from "Matthew Jacob" at Oct 30, 2000 02:14:03 PM
next in thread | previous in thread | raw e-mail | index | archive | help
> > > I recently wrote an initial "scratch pad" design for per-CPU mbuf > > > lists (in the MP case). The design consists simply of introducing > > > these "fast" lists for each CPU and populating them with mbufs on bootup. > > > Allocations from these lists would not need to be protected with a mutex > > > as each CPU has its own. The general mmbfree list remains, and remains > > > protected with a mutex, in case the per-CPU list is empty. > > > > I have only one question - is the lock overhead really so high that this > > is needed? > > If you know that you can also pre-busdma wrap these lists (which is required > for full alpha support, and may(?) be for ia64), yes, this makes sense to me > (at least). I had a friend at Sun not speak to me for years because I didn't > do this for the Solaris DKI/DDI. I really, really like per CPU resource pools. It's one of my SMP "hot buttons". I suggest reading the technical white papers at: http://www.sequent.com/software/operatingsys/dynix.html I also recommend "UNIX for Modern Architectures", and chapters 12 and 15 of the Vahalia book: UNIX internals: the new frontiers Uresh Vahalia Prentice Hall ISBN: 0-13-101908-2 Chaper 15 has an interesting point about LWPs in SVR4/MP, where kernel mappings use lazy TLB shootdown, but user mappings use an immediate shootdown policy, to insure that a mapping going away in one LWP goes away in other LWPs. Bonwick, who wrote the Solaris SLAB allocator, indicates in his paper about the allocator acknowledges that the allocator would benefit from per-CPU caches, like those in Dynix. See also: http://www.usenix.org/publications/library/proceedings/bos94/bonwick.html Unfortunately, the Winter 1993 proceedings, which has the Dynix paper, are not online. The authors of the paper have an insane number of SMP related patents, including a neat one on "A substantially zero overhead mutual exclusion apparatus..." that looks like soft updates applied to concurrency. Per CPU resource pools with high and low watermarking were part of the Dynix design. Here is a passage from Vahalia, chapter 12, about it (my comments and contextual clues are bracketed): 12.9 A Hierarchical Allocator for Multiprocessors Memory allocation for a shared-memory multiprocessor raises some additional concerns. Data structures such as free lists and allocation bitmaps used by traditional systems are not multiprocessor-safe and must be protected by locks. In large, parallel systems, this results in heavy contention for these locks, and CPUs frequently stall while waiting for these locks to be released. One solution to this problem is implemeneted in Dynix, a multiprocessor UNIX variant for the Sequent S2000 machines. It uses a hierarchical allocation scheme that supports the System V programming interface. The sequent multiprocessors are used in large on-line transaction-processing environments, and the allocator performs well under that load. Figure 12-9 [omitted] describes the design of the allocator. The lowest (per-CPU) layer allows the fastest operations, while the highest (coalesce-to-page) layer is for the time- consuming coalescing process. There is also (not shown) a coalesce-to-vmblock layer, which manages page allocation within large (4MB-sized) chunks of memory. [the diagram has a middle global layer, from which the per-CPU caches are refilled] The per-CPU layer manages one set of power-of-two pools for each processor. These pools are insulated from the other processors, and hence can be accessed without acquiring global locks. Allocation and release are fast in most cases, as only the local freelist is involved. Whenever the per-CPU freelist becomes empty, it can be replenished from the global layer, which maintains its own power-of-two pools. Likewise, excess buffers in the per-CPU cache can be returned to the global free list. As an optimization, buffers are moved between these two layers in target-sized groups (three buffers per move in the case shown in Figure 12-9), preventing unnecessary linked-list operations. To accomplish this, the per-CPU layer maintains two free lists, main and aux. Allocation and release primarily use the main free list. When this becomes empty, the buffers on aux are moved to main, and the aux list is replenished from the global layer. Likewise, when the main list overflows (size exceeds target), it is moved to aux, and the buffers on aux are returned to the global layer. This way the global layer is accessed at most once per target-number of accesses. The value of target is a tunable parameter. Increasing target reduces the number of global accesses, but ties up more buffers in per-CPU caches. ... 12.9.1 Analysis The Dynix algorithm provides efficient memory allocation for shared memory multiprocessors. It supports the standard System V interface, and allows memory to be exchanged between the allocator and the paging system. The per-CPU caches reduce the contention on the global lock, and the dual free lists provide a fast exchange of buffers between the per-CPU and global layers. It is interesting to contrast the Dynix coalescing approach with that of the Mach zone-based allocator. The Mach algorithm employs a mark-and-sweep method, linearly scanning the entire pool each time. This is computationally expensive, and hence is relegated to a separate background task. In Dynix, each time blocks are released [from the global layer] to the coalesce-to-page layer,the per page data structures are updated to accont for them. When all the buffers in a page are freed, the page can be returned to the paging system. This happens in the foreground, as part of the processing of release operations. The incremental cost for each release operation is small; hence it does not lead to unbounded worst case performance [unlike the Mach allocator]. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200010310039.RAA29844>