From owner-freebsd-current Thu Jun 14 10: 6:38 2001 Delivered-To: freebsd-current@freebsd.org Received: from technokratis.com (modemcable052.174-202-24.mtl.mc.videotron.ca [24.202.174.52]) by hub.freebsd.org (Postfix) with ESMTP id 7052437B405; Thu, 14 Jun 2001 10:06:09 -0700 (PDT) (envelope-from bmilekic@technokratis.com) Received: (from bmilekic@localhost) by technokratis.com (8.11.3/8.11.3) id f5EH6sr24050; Thu, 14 Jun 2001 13:06:54 -0400 (EDT) (envelope-from bmilekic) Date: Thu, 14 Jun 2001 13:06:54 -0400 From: Bosko Milekic To: Terry Lambert Cc: freebsd-current@FreeBSD.ORG, freebsd-alpha@FreeBSD.ORG Subject: Re: New SMP Mbuf Allocator (PATCH and TESTING request) Message-ID: <20010614130654.A23642@technokratis.com> References: <20010613010744.A4083@technokratis.com> <3B286FC1.5C112A5C@mindspring.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5i In-Reply-To: <3B286FC1.5C112A5C@mindspring.com>; from tlambert2@mindspring.com on Thu, Jun 14, 2001 at 01:03:13AM -0700 Sender: owner-freebsd-current@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG Hi Terry, On Thu, Jun 14, 2001 at 01:03:13AM -0700, Terry Lambert wrote: > > A general comment, and then a comment on the patch: > > I would like to see this code be optional, until such time > as benchmarks have proven it to be better than the existing > code, or have proben it to be worse. Diking out the existing > code, and forcing it to be diked out by making unnecessary > function name changes seems to me to be a problem. This is going to be a very difficult thing to do. Hear me out. :-) The present code has been designed without any SMP in mind and without any concept of allowing for the possibility to have pages that were once wired-down and put to use by the mbuf subsystem to being freed when no longer used so that memory could be reclaimed. The new code attempts to offer both, but as with almost anything, sacrifices must be made. I'm hesitant to accept your `benchmark this' offer merely because it's difficult to accurately quantify overall performance while it will be easy for the benchmark to detect that the new code is in some ways slower than the old (because of additional complexity) and therefore rule: old allocator > new allocator, which is false given the ignorance of most benchmarks. I'll try my best to circle some of the Pros and Cons of each allocation technique and try to give you an overall picture of the tradeoffs and gains made. - The old allocator has one lock to protect all the free lists. As the number of CPUs grows, so does contention and, consequently, the mbuf allocator may become a significant bottleneck when several threads are servicing net-related requests. The new allocator has a lock per CPU thus allowing for more than one CPU to allocate mbufs and clusters at any one time, ultimately splitting contention in most cases. Additionally, the old allocator, in an SMP system, will likely have the lock (and list structure, etc.) ping-ponging from data cache to data cache whereas the new allocator, because of the separate lists and locks will minimize this effect. - The old allocator fragments memory and can never reclaim it and never easily determine if all the objects in one given page of data are free therefore if we were to ever decide that yes, it's now time to have the allocator free some pages back to the map, we won't be able to easily implement it (if at all). The new allocator has the "framework" for this type of thing so that if users continue requesting (as many have already done) to have FreeBSD free up resources it allocates for mbufs and clusters, we should be able to relatively easily implement freeing from a kproc, for example, and only do it, for example, when X number of bytes is in the pool but when only Y bytes are in use (where Y <<<< X). (Why from a kproc? Because freeing back to the map is relatively expensive for a thread that wants to merely free up an mbuf or cluster - this I *have* determined from previous profiling - results are probably archived somewhere on the lists). - The old allocator's macros MGET, MCLGET, MFREE, etc. could get away with making an allocation without performing a single function call. A function call was done only by those using the function equivalents m_get(), m_gethdr(), etc. The new allocator performs one function call in the common case for allocation. I don't foresee this to be a real problem - if it's any argument at all: NetBSD && OpenBSD have both been doing this for quite a long time (one function call per allocation) and in fact so have we before this allocator was introduced to replace calls to malloc() _and_ also as a result, the new allocator has shrunk those macros to effectively nothing in terms of size thus improving overall performance (less code cache pollution) whereas the old allocator has, needless to say, pretty large macros. - The common case of allocation and freeing in the new allocator is larger (although realistically, not significantly) than the common case for the old allocator. This is due to the added complexity in the new allocator and is expected. The increase in size of the common case is basically from `insignificant' to `insignificant + insignificant' when taking into account the fact that CPU speed is increasing incredibly quickly. Keep in mind that although it may take longer to execute the common case in the new allocator, the fact that there is less contention in the same common case may make it faster, overall. - The old allocator has all mbufs and clusters mixed in the same heap and there is no concept of `producer' and `consumer' in that although one thread may allocate an mbuf and write to it, and another thread may only free the mbuf, the second thread will, if the first thread is on another CPU, that CPU's data cache entries for the given mbuf because during freeing, the thread will write to the mbuf. Since in the new allocator mbufs are not written to when being freed and since they are always returned to the list from which they were obtained, cache invalidation is once again minimized. In conclusion, it is obvious that for strictly UP systems, the present allocator may turn out to be faster. But, if you look at it that way, then the mbuf allocator _BEFORE_ the whole SMPng thing started was EVEN better for UP systems (no lock code, etc.) On the other hand, the elimination of huge macros may even help improve overall performance even for UP systems. In the SMP case, the new allocator scales much much better. Both of these statements are obvious from the above points/comparisons. I hope this makes it clear that depending on the benchmark, we may see completely different results. Keep in mind though that the new allocator also offers the possibility of freeing back to the map, a significant advantage over the old one, especially for servers with exceptionally high network loads (for example, an IRC server may consume a lot of memory for mbufs and clusters at one point, only then to drop to normal amounts later again. However, the fact that the pages are never freed and are wired-down means that the system may start swapping simply due to wastage on the mbuf and cluster free lists). > Overall, I like the idea of moving towards a Dynix allocator > type model to support many processors (not just 4) well, but > I'm leery of many of the changes, particularly in light of > the claim that "It has been established that we do not benefit > from different locks for different objects, so we use the same > lock, regardless of object type". Well, if you'll note that not long ago, there used to be three different locks: one for mbufs, one for clusters, and one for counters. However, with three separate locks and the fact that callers often do: "get mbuf -> get counter -> get cluster" often resulted in one CPU doing: "cache first lock -> cache second lock -> cache third lock" and then another would have to invalidate the first CPUs cache three times (as opposed to only once with one lock). As a result, ping-ponging from CPU to CPU to CPU etc. often occured. With one lock, this is minimized. In the new allocator, having separate locks is not really useful since we have one lock per CPU anyway so that comment is not really an attempt at optimization but just a statement of the obvious. > On to the specific comments... I have addressed them in patch > order, so that they can be read side-by-side: > > -- > > The "npg" calculation has always been incorrect for large > memory systems with non-sparse physical maps, and now it > is much worse. I commented on this before in -arch. > > Consider the case of a machine with 4G of physical memory. Can you please clarify? > -- > > Is the policy two line or 4 line license, these days? > > -- I don't know. How is this important? > The MBALLOC_CPU stuff should be more dynamic; it appears > that the real number of CPUs is known at the time the > allocations take place. > > -- Yes, but some structures (notably statistics and the list manager structures) are statically allocated (the stats are exported via sysctl and need this) and so we need to know how many CPUs there are before compilation. It doesn't really matter if we don't, though, because then we'll just end up allocating slightly more than we need (we use MAXCPU in the default case) and we'll be on with it. Note that this does *not* mean that we'll also end up setting up MAXCPU CPU lists, as that setup is done at runtime when we know really how many CPUs we have. > I think you should spin, and only block if you don't get > the spin lock after several attempts. The Solaris tradeoff > in this regard is 10 attempts. Direct blocking on the cv > seems a little wasteful. In the common case, I shouldn't have to even block. These are per-CPU locks and blocking only happens when a different CPU is freeing an mbuf to a first CPUs list. I don't think I'm doing anything wrong with using the standard locking primitives provided by the OS. Spinlocks disable local interrupts and that seems more wasteful than blocking. I could be wrong, but these things are easy to change. > I'm not sure why you mutex the per-CPU containers? Aren't > they, by definition, per-CPU, and therefore immune from > contention? I mutex them for two reasons: 1- A thread on another CPU may be freeing to the first CPU if the mbuf it is freeing originated from the first CPU (this is a characteristic of the allocator model). 2- Preemption. > -- > > You realize that there is a dependency on MAXFILES, right? > > Specifically, there is an mbuf allocated per open socket > for the tcptmpl (even though it only needs 60 bytes). > > --- > > If you fix the tcptmpl allocation, a couple of additional > issues arise: > > 1) Since the allocations are in page-units, you will > end up with a lot of waste. > > 2) Allocation in page units will make it hard to give > 60 byte allocations back to the system. > > 3) The allocator is not really general enough for this > (or for things like TIME_WAIT zombie structs, etc.). > > 4) Yeah, this sort of also implies that mbuf cluster > headers come from a seperate pool, instead of being > 256 bytes as well... Huh? I haven't changed anything in terms of the sizes of allocations. The allocator only allocates mbufs and clusters, which are of fixed size in order to ease allocations of the same object in page units (it's much faster and simpler). I realize that some wastage occurs but that's just a fact of BSD (and has been forever). The new allocator doesn't make it worse. > -- > > Why do you lock around things in mb_init()? It's never > called twice, let alone reentrantly... the code will be > done before an AP is allowed to run. Alpha has this same > restriction, from what I can see from the code. > > -- Simply because mb_pop_cont() expects it and because it's a consistent thing to do with respect to the internal interface. > Dropping the lock in mb_pop_cont() seems wrong; I guess > you are avoiding sleeping with the lock held? > > I don't think you need to worry (per-CPU, again), unless > you later attempt lazy repopulation; it seems to me that > the lock that should be held is a global lock, to prevent > reentrancy on the global memory pool, and the malloc() > code does that. > > Perhaps I'm missing something here. Yeah. There have been some discussions as to why this is presently necessary. In the future, it may not be required. But, basically, right now, we can have Giant -> mbuf_related_lock and, if I don't drop the lock, then also mbuf_related_lock -> Giant, which would be a lock order reversal. > -- > > The ability to specify a "how" that implies an unwillingness > to block implies interrupt allocation; yet you can't support > this, since you do not preallocate a kernel map (ala the zone > allocator). The zone allocator is actually very misleading, > in that some of it's code never has the values it implies that > it might have; in particular, there are functions which are > passed parameters which can't really be variant, as implied by > the use of per zone variables as arguments. Although I don't quite understand everything you mention in this last paragraph, the `how' parameter implies more than just willingness to block in the new allocator. It also determines whether or not, during starvation, the system will call the protocol drain routines and whether or not it will attempt to "steal" from other CPU lists. Feel free to clarify, though, if it's still a concern. > -- > > Ah. I see why all the locks: mb_alloc_wait attempts to be > a primitive reclaimer. You would do much better with a per-CPU > high-watermark reclaim at free(), I think, instead of doing > this. Really, you are in a starvation condition because of > your load, and not because someone is "hogging already free > mbufs", I think. This is probably premature optimization. > > -- > > In the starvation case during a free, I don't think you > really care, except perhaps to adjust the high/low watermarks. > > An interesting issue here is that the average TCP window size > will end up being 16k, which comes out to 4 1-page buckets (2, > on Alpha), where you end up with multiple buckets and many > mbufs (64) per, so freeing isn't really an option, if you are > sending lots of data (serving content). It would be different > for a desktop system, since the receive interrupt might come > to any idle CPU. > > -- > > I like the cleanup via the use of local macros. Thanks. So do I. :-) > -- > > I think the m_getclr -> m_get_clrd change is gratuitous. m_getclr() isn't used at many places so it's not that big of a deal. The reason I changed it is because it's confusing: m_getclr() sounds like "m_get a cluster." > -- > > I like the "only" comment claifications. > > -- > > I dislike the bloating of one line comments into three lines, > with the first and last blank. It happens once or twice, as far as I could see (I could be missing some). It's to maintain consistency. The one-line comment likely should have been a three-line comment to begin with. > -- > > You could resolve the statistics problems by keeping them on > a per-CPU basis, and then agregating them under a general > system lock, when necessary (when they were requested). > > -- > > Most of the header definition changes for function declarations > are gratuitous. It's a question of alignment and style fix (to properly align the function declarations as per style(9), I'm allowed to insert a space for functions that do not return pointers). > -- > > Don't really care about netstat stuff... > > -- > > -- Terry -- Bosko Milekic bmilekic@technokratis.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-current" in the body of the message