From owner-freebsd-arch Thu May 3 16:56: 5 2001 Delivered-To: freebsd-arch@freebsd.org Received: from technokratis.com (modemcable049.41-203-24.mtl.mc.videotron.ca [24.203.41.49]) by hub.freebsd.org (Postfix) with ESMTP id AE50337B424 for ; Thu, 3 May 2001 16:55:31 -0700 (PDT) (envelope-from bmilekic@technokratis.com) Received: (from bmilekic@localhost) by technokratis.com (8.11.3/8.11.3) id f43Nx4Z53585 for freebsd-arch@freebsd.org; Thu, 3 May 2001 19:59:04 -0400 (EDT) (envelope-from bmilekic) Date: Thu, 3 May 2001 19:59:04 -0400 From: Bosko Milekic To: freebsd-arch@freebsd.org Subject: Mbuf slab [new allocator] Message-ID: <20010503195904.A53281@technokratis.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Hello, Anyone interested in the mbuf subsystem code should probably read this. Others may still read it, but it is somewhat longer than your average Email, so consider this a warning. :-) Also, although I tried my best to cover most issues here, feel free to let me know if I should clarify some points. Not so long ago, as I'm sure some of you remember, Alfred committed a patch to the mbuf code that collapsed what then were three seperate mutex locks - one for the mbuf free list, one for the cluster free list, and one for the m_ext counter free list - into one single mbuf lock, locking all three free lists. After an argument with Alfred, I caved (as I sometimes do) and agreed with him on the assertion that overall, this would be a good move seeing as how we'd probably move to a slab allocator for the mbuf subsystem and that when we do, it will be more advantageous to have one lock for all three object types in order to reduce too much cache invalidation ping-ponging from one CPU to the other. Earlier than that, even, Alfred had written what I think presently needs work to run (I'm not sure if it compiles in the state that it presently is in), what he calls the "memcache allocator." He put the code here: http://people.freebsd.org/~alfred/memcache/ After reading the "memcache" code, I was generally happy with it, and Alfred has provided documentation backing up overall performance of the slab allocator [that is, not necessarily the performance of `memcache' itself, but rather the concepts that memcache was built on] and after thinking about the whole thing for a while agreed with Alfred that a slab allocator would be a worthy thing to have for mbuf and mcluster allocations. However, there were some things that clearly needed to be done differently from memcache, and I wanted to keep the flexibility of being able to micro-optimize for the mbuf-related allocations in the future. Also, mbuf-related allocations work differently than `regular' allocations in that the status quo is such that we do not free VM pages back to the map that we allocate them from. So once we allocate a page, we take the address space from the map, wire it down, and never return the address space. This saves us from having to free back to VM when freeing mbufs - a relatively expensive task - and allows us to keep mbufs and mclusters `cached' on free lists, from which we can quickly allocate. Sure, memcache provides the option ensuring that we'll also keep a certain (tunable) number of objects cached on free lists, but when memcache cannot find a page and when it needs to allocate one, it relied entirely on the VM code to do the blocking for it, until a page would be freed to the map, or until it could find a page (depending on the nature of the exhaustion). For the mbuf allocator, we needed to deal with a third type of exhaustion. With the mbuf allocations, you have a finite amount of address space which, as the number of allocations grows, shrinks. So, when you see that you have no more mbufs or mclusters to allocate, you have to deal with the problem in one of two ways: 1 - either you have address space left in your map, in which case you're entitled to a page and in which case you may very well block in the VM code until you get it. or 2 - you have already allocated all the pages that you are entitled to (i.e. no more address space) and so your only option is to drain the protocols and block waiting for a consumed object (mbuf or cluster) to become available. This blocking is done in the mbuf code and is implemented through a condition variable. So clearly, these two conditions needed to be maintained. Furthermore, I wanted to at the same time leave the possibility to be able to implement some sort of `lazy freeing' of pages that would be done periodically in the spirit of your classic garbage collector. In short, I wanted memcache with several twists and turns, as is usually required for the mbuf code. So, thanks to Alfred's patience in answering some questions I had, I decided to write an mbuf slab-like allocator. I put a relatively early version of the code here: http://people.freebsd.org/~bmilekic/code/mb_slab/ If you're still with me, a few mentions regarding the code: - I split out the allocator code into kern_mbuf.c, and left the mbuf utility routines in uipc_mbuf.c - mbuf.h macros are a lot cleaner/smaller, I plan to make them even more smaller eventually. - Matt Dillon had some really neat suggestions on implementing real fast allocations from the PCPU lists that the allocator maintains. Some complications arose due to the fact that we have pre-emption enabled. Work-arounds were also proposed, but I left the optimization out of this code for now. We can deal with it later and I am interested in perhaps inlining, if size permits, a portion of the `quick allocation' in the macros, once more, but for now I'd rather it remains a function call in all cases. - The general allocation and free cases are very fast. Very comparable to what we presently have. The downside is slightly more code due to tougher list manipulation during allocation but then the upside (along with other things that I don't feel like mentionning right now) is that contention is reduced in MP environments. So it comes out about even [*], if not even better now due to other improvements. Unfortunately, I can't really benchmark any of this so please don't ask (although feel free to try yourself and/or to also refer to the documentation available on the Internet discussing advantages of slab allocators in overall performance). [*] Yeah, this is a very rough inductive prediction, although it is supported by many factors. - There is more documentation in kern_mbuf.c itself. - mbstats is broken for now. Re-implementing statistics is relatively easy and I left it out of the first version. - I have been running this code for over a week now on a dual-proc. CURRENT machine. It has been running fine although I have not tried to forcibly exhaust the maps to see how it behaves as I have not adjusted kmem_malloc() to work exactly as it needs to in order for proper blocking to occur. In case you're wondering exactly what it is that kmem_malloc() needs to be able to do: it just needs to be able to mark a map_starved flag once the map it's allocating from has been exhausted, if that map is mbuf-related (just like it presently does for mb_map). The point is: kmem_malloc() shouldn't block itself if the map is exhausted as it will never wake up because we never free anything back to the map. - Free lists in `buckets' (you'll get it if you glance at the code) are implemented not by linking mbufs or clusters in a singly-linked fashion as is done in the current mbuf code. Rather, a pointer-array is maintained (like memcache does). This avoids having the thread freeing the object write to the object (and consequently likely invalidate the producers cache). Unfortunately, this turns out to suck very much for m_ext counters, as they are 4 bytes each. Having a pointer array to a page of 4-byte objects consumes another page on the x86, for instance. I am thinking of alternatives for m_ext counters because of this problem. In fact, I am toying with the idea of replacing allocatable m_ext counters with something else - perhaps storing the counter in some area of the cluster or whatever other provided m_ext buffer (sf_buf, etc.) - You'll note that mb_map is subdivided into three other submaps: one for mbufs, one for clusters, one for counters. This is done for several reasons. One is to avoid having cluster allocations consume the portion of the map explicitly reserved for mbufs, thus reducing the total number of allocatable mbufs (because we will never free that space back to the map, once we have consumed it for clusters, it will stay for clusters). Peter Wemm had suggested eliminating mb_map and submaps and having mbuf allocations made directly from mb_map's parent map, kmem_map. The limit for allocations would be limited by a sysctl-tunable variable. Alfred suggested having lazy freeing done back to kmem_map on the behalf of the mbuf subsystem, when needed, in order to avoid having mbufs and clusters completely consume kmem_map. I was toying about implementing this as well but ran into some problems. Notably, in order to be able to take the address of an object being freed and retrieve its corresponding `bucket structure' (again, if you glance at the code, this will make sense), I need to be able to produce a unique index from the address of the page in which the object is found. This is easy to do in the present implementation of mb_slab as each type of object (mbuf or cluster) has its own map, so I can safely do: (page_address) - (base_of_map) and divide by PAGE_SIZE, then use that to index into a pointer-array holding the address of my bucket (memcache does a similar thing, whereas our implementation of kernel malloc() uses a slightly different hashing method). The obvious solution would be to keep a sparse pointer table which would account for every page in kmem_map. But the problem with this is that the resulting table would be a little too large for what it's worth. Well, that's about all that I can think of for the moment. I would appreciate feedback from you folks both on the strictly technical aspects of this Email as well as on other suggestions you may have regarding the issues presented above. Unfortunately, my final exam period begins next week and I likely will be unable to do any real work on this until about 2 weeks from now. After exams, I would like to finish cleaning up this version of the allocator (without Peter's suggestion and `lazy freeing' implemented yet) and hopefully commit that first (unless serious objections arise, as usual). Once that's done, I would then continue to work on adding the latter two, if we can work out the problems, and whatever else comes up/requires improvement. Regards, -- Bosko Milekic bmilekic@technokratis.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message