Date: Sun, 26 Jan 1997 14:05:45 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: proff@suburbia.net Cc: hackers@freebsd.org Subject: Re: SLAB stuff, and applications to current net code (fwd) Message-ID: <199701262105.OAA02273@phaeton.artisoft.com> In-Reply-To: <19970126042316.10096.qmail@suburbia.net> from "proff@suburbia.net" at Jan 26, 97 03:23:16 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > I'm going to try feveriously to get the SLAB allocator integrated into > > Linus's sources over the next two days. For the most part my > > incentive is so that people think about it when they design memory > > object allocation subsystems. > > > > For example, even right now, look at the way struct sock's are indeed > > allocated. Alan's recent change to add sock_init_data() and the fact > > that my sources already use SLAB for struct sock sparked this idea. > > > > We could in this case just make sock_init_data() the constructor > > routine for the sock SLAB. So for a warm SLAB cache this code never > > gets run as long as users of sock's are not forgetful and leave a sock > > in a reasonable state when they free them. (ie. don't leave crap on > > the receive queue etc.) > > Can anyone inform me what a SLAB allocator is, and if so, would freebsd > benefit from one? In simple terms, it allocates "slabs" of memory for memory pools of a particular type, generally in units of pages. It is basically a variant of the zone allocator (per MACH). You can implement memory zoning, per MACH, to tag object persistance within a SLAB allocator. Assuming you allocate the kernel space itself with the SLAB allocator, and the kernel image is assembled on SLAB boundries, you can even do things like zone discard, which let you throw away initialization code once the system is up, and reclaim the memory for reuse by the system. One of the reasons I want ELF is to allow zone discard. Technically, FreeBSD already does SLAB allocation, or at least its interface looks like it does. The sys/malloc.h values M_MBUF, M_SOCKET, M_NAMEI, etc. used by the kernel malloc, are all SLAB identifiers. --- In a kernel multithreading or SMP environment, you don't *want* a SLAB allocator... at least, not a pure one, as the base level for allocation. You want a global page-pool allocator instead, and then you will implement your allocator on top of the page pool. There are even reasons you might want to have this on a simple UP system. Because the kernel is reeentrant, you can reenter the allocator on a SLAB id. FreeBSD currently copes with this (badly) on its fault/interrupt reentrancy case... for example, allocating an mbuf at interrupt level requires that the mbuf SLAB be guarded against reentrancy by running the allocation to completion at high SPL so that it is never really reentered while the data structure is in an indeterminate state. This adds to processing latency in the kernel. In reality, each context wants its own set of SLABs. This may be overkill for exception reentrancy, actually. The way Microsoft does this in Windows95 and NT is dividing the interrupt service routines into "upper" and "lower", with "lower" running at interrupt mode, and "upper" running in non-interrupt mode. In Windows95 and NT, you do not do memory allocation at interrupt level; you must preallocate the memory at driver init, and allocate it again in "upper" code if the preallocated object is consumed by the "lower" code. If you don't want to make this restriction on the use of allocators (there are several places where the preallocation overhead would be exhorbitant, like FDDI or ATM with large card buffers), then you must provide a seperate set of SLABs for a context. In terms of context SLABs, you *do* want seperate SLABs for each processor context in SMP. The reason you want this is that using a global SLAB pool, like in SVR4 or Solaris, you must line up all your processors behind the IPI in order to synchronize access to the SLAB control object for the object type you are allocating. The correct method, first demonstrated in Sequent Dynix, is to have per processor page pools. The synchronization is not required unless you need to refill the page pool from (low water mark) or drain the page pool to (high watermark) the global system page pool. This method is documented in: "UNIX Internals: The New Frontiers" Uresh Vahalia, _Prentice Hall_ ISBN 0-13-101908-2 Chapter 12 _Kernel Memory Allocation_ Section 12.9 _A Hierarchical Allocator for Multiprocessors Sequent failed to implement SLAB allocation on top of this page pool abstraction, and so Vahalia's Analysis is rather harsh, compared to his analysis of SLAB allocation (covered Section 12.10). But it is incorrect to call the SLAB allocation itself superior. Vahalia cites a paper: "Efficient Kernel Memory Allocation on Shared Memory Multiprocessors" McKenney, P.E. and Slingwine, J. Proceedings of USENIX, Winter, 1993 Which shows the sequent code to be faster than the McKusick-Karels algorithm by a factor of three to five on a UP, and a factor of one hundred to one thousand on a 25 processor system. Clearly, if we considered contexts as owning pools instead of CPU's, we should expect a three to nine times improvement for UP BSD from having seperate contexts for interrupt vs. exception vs. normal allocations (in place of running the allocations to completion at high SPL). This might not have more than a 10% scaled effect on a heavily interrupting FreeBSD system, but a 10% improvement is an improvement. There are a number of issues, like object garbage collection, which you would do using cache invalidation at IPI synchronization time to determine if a low water mark hit is real or not. For instance, I may allocate an mbuf on CPU 1 and pass it's address to CPU 2 for use by a user process context entered in the TCP code. If the CPU 2 process then deallocates the MBUF, the cache line indicating the allocation will not have been invalidated. Effectively, this means that there must be an allocation map with each object, and that it must be dynamically scoped. This lets CPU2 mark the map entry as invalid, even though CPU1 did the allocating. CPU1 would sync his picture to that of the global picture at low water mark, reclaiming released buffers at that time. In reality, it's probably better to have a message interface per SLAB per CPU to propagate deallocation messages... if you didn't do that, then a deallocation on CPU1 or CPU3 could cause a corrupt cache line to be written. Rather than fighting between CPU's and being careful at the cache line level, the IPI synchronization should allow message delivery at IPI; this will generally be very fast, since the memory can be preallocated with page attributes so the que pointer ownership is toggled at IPI time. We can go into this in detail if anyone wants to, but the SMP list is probably a better forum for these issues. Regards, Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199701262105.OAA02273>