Date: Tue, 2 Jul 2002 13:01:35 -0400 From: Bosko Milekic <bmilekic@unixdaemons.com> To: Luigi Rizzo <rizzo@icir.org> Cc: net@FreeBSD.ORG Subject: Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?) Message-ID: <20020702130135.A17370@unixdaemons.com> In-Reply-To: <20020702085640.B5854@iguana.icir.org>; from rizzo@icir.org on Tue, Jul 02, 2002 at 08:56:40AM -0700 References: <20020702001050.B2250@iguana.icir.org> <20020702080423.A69693@unixdaemons.com> <20020702090354.A78632@unixdaemons.com> <20020702085640.B5854@iguana.icir.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jul 02, 2002 at 08:56:40AM -0700, Luigi Rizzo wrote: > > Hi, > you can read the numbers I posted as you like, but in the non-SMP > case performance is not "seemingly" better, it IS better. The reason I said "seemingly" is because of the fact that I don't think that it is your "grouping" of both clusters and mbufs that specifically causes the perf. increase to the extent you showed with your tests. What you're doing in the code you posted is also avoiding an extra malloc() allocation of the cluster reference counter. Initially, the plan was to entirely remove the malloc() and store the ref. count at the end of the cluster. Briefly, I agree that the interface is useful (I already mentionned this, didn't I?) but I don't think that the approach of holding mbufs and clusters tied together on yet another list is the right one to make at this point. You mention that it's all OK under Giant, but I'd like to point out that making these sorts of "OK under Giant" assumptions when introducing new interfaces is taking us a step BACKWARDs, especially in light of the lockdown work that's being done. I propose the following compromise that I hope you will agree with: 1) Remove the malloc() for external ref. count allocations; instead, store the ref. count at the end of the cluster like I did when I took the original mb_alloc performance measurements. This should make allocations and frees visibly faster (I measured it, http://people.freebsd.org/~bmilekic/mb_alloc/results.html); 2) Introduce an interface that will allocate and free: (i) an mbuf with a cluster attached to it; (ii) an M_PKTHDR mbuf with a cluster attached to it; However, this interface would wrap the appropriate alloc/free routines, although it would take care to not drop the allocation lock between the allocations. I don't suspect this to be too difficult to do. I have a little bit of free time now while waiting for KSEIII to settle before taking up lightweight interrupts again, so I'm prepared to do both of these. > Before people think differently: > > the code i posted is _perfectly safe_ to use under the assumption > that I used -- i.e. that you hold Giant when manipulating the > free list -- on both non-SMP (where i tested it, and where it > does improve performancw) and SMP systems (where I have no idea > about performance, but i might have some numbers by tomorrow), > under the assumption that I used -- i.e. that you hold Giant > when manipulating the free list. It is not safe if you use it too often. The allocator was designed to allocate, it HAS caches, it doesn't need for everyone under the Sun to start keeping their own caches on top of that. > The above assumption holds in the device drivers i patched, and > also (at the time of this writing, and I suspect for quite a bit, > although I and hopefully others are working on this) in all the > network stack. > And of course you can #ifdef the code out for the SMP case if you > don't trust it so you won't pessimize the SMP case. > > Also: > > I am not violating _any_ interface. > I am just holding apart a small set of buffers for future use, > which is exactly the same thing that each and every network > device driver in FreeBSD does when it links them in the receive > ring -- no guarantee when/if they will be freed, certainly none > of them is freed even if the system runs out of mbufs (unless > the network interface is brought down, which i doubt it happens). > > This said, a detailed answer to your points: > > #1 I think your comment on the sleepers and mbuf exhaustion is > not correct. > I am only talking of holding apart a small set of buffers+clusters, > in the order of ~50 blocks, and you can restrict the consumers > of the free pool to those who do not want to sleep, so there > is not need to special-case the code for the sleepers. > Or, if you really want to special case it, the first sleeper > which finds the free pool empty and needs to go to sleep will > reset max_pool_size to 0 forcing the same behaviour we have now > with correct handling of wakeup and such. Here's what happens: Consumers A and B each keep their own "pools" of mbufs+clusters. Consumer C does not. At one point, A and B start allocating a little bit from their own pools. C comes in and allocates and eventually we run out of stuff to allocate so C blocks waiting for something to be freed. However, if A and B decide to free to their local pools instead of via the allocator, C will not get woken up. To make matters worse, here's a common scenario: A keeps his own pool. A allocates and allocates and allocates from local pool. C comes in and allocates "normally" through the allocator. C cannot find a cluster and is willing to block, so it does. A frees. No wakeup. A frees. No wakeup. A frees. No wakeup. ... maybe D frees and generates wakeup, maybe not. > #2 given that all i am asking for is defining an interface for this > type of optimization, when the time comes that Giant goes, and > assuming I am not retired yet, it will still change nothing for > the non-SMP case, and for the SMP case we can run our performance > tests and if it proves to be useful we can put in whatever > is necessary. I already told you that I think that the interface is OK; I don't think that keeping yet another freelist for this sort of thing is OK. The mbufs and clusters will be on their respective free lists anyway, there really isn't a point of keeping yet another list around, if you read the compromise above and assume that the malloc() will disappear from the common cluster allocation case. > #3 easy to fix (assuming this is really a problem) -- just have the > free pool organized as a list of pointers to the mbufs, rather than > using the m_nextpkt field as the link field. Because the free > list is short and fixed size you can even use an array to > hold it. This is what the existing cache lists do anyway. Again, there is no point of doing this AGAIN. > #4 same answer of #2: if/when... we can easily fix things in just > one place. Until then, we are just pessimizing things for no > good reason. > > In terms of API, since we both agree that we need one (and we do > need one, otherwise driver writers will just go ahead and do their > own, replicating work in each driver, as I am sure I^hthey will > start doing now), may I suggest that we export two-- one where the > caller is assumed to already hold a lock on the CPU it is running > on, and one which also does the necessary locking upfront and calls > the latter afterwards ? This way the former will be usable right > now and efficiently from all contexts which run under Giant, and > easy to upgrade when/if we need it. > > And, I am sorry to say... you raise some valid points, but without > performance numbers to justify them, it is hard to tell how relevant > they are in practice. > > I have made too many mistakes of this kind myself to believe my > intuition. > > cheers > luigi > > On Tue, Jul 02, 2002 at 09:03:54AM -0400, Bosko Milekic wrote: > > > > There are several problems with your "simple" code that make it simpler > > > than what's in -CURRENT and in -STABLE and therefore yield [obviously] > > > seemingly better times. > > > > > > 1) sleeping in the case of a system exhausted of mbufs and clusters is > > > now broken because code that frees to this "combined" list will not > > > generate wake ups; even if you build this functionality in, you would > > > have to code the wakeup for a "special" case where one obtains both a > > > cluster and an mbuf; it'll get complicated. > > > > > > 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU > > > back to global, which you probably don't care about now, but which will > > > matter when Giant goes completely. > > > > > > 3) you violate write/read producer/consumer model. In -CURRENT, we try > > > to not have the code doing the freeing write to any part of the mbuf, > > > thus hopefully avoiding cache invalidation from occuring during a free > > > that happens from a thread that doesn't usually write to the mbuf at > > > all. > > > > Forgot, also, in -CURRENT: > > > > 4) breaks framework in mb_alloc that groups pages of mbufs and > > clusters together allowing them to be freed in the future, if/when we > > decide to implement lazy freeing; it would do the same thing if we > > mvoed mbuf allocations to UMA, because grouping together mbufs and > > clusters on a separate list will violate that scheme. > > > > > I think that because of (1) and (2), especially, and because the kernel > > > in -CURRENT is the way it is right now, you're getting better > > > performance numbers. Try exhausting the system and seeing how far you > > > get. I suspect you'll eventually just hang (because you'll allocate a > > > bunch of stuff violating the real interface and then won't be able to > > > wakeup and Do The Right Thing). > > > > > > While, as I said before, I agree that we should have an interface that > > > allocates both an mbuf and a cluster, I don't think that this is the > > > right approach. I believe that the correct approach would be to couple > > > the mbuf and cluster allocation under a single per-CPU lock/unlock > > > operation, which isn't too difficult to do. > > 1 > > To Unsubscribe: send mail to majordomo@FreeBSD.org > with "unsubscribe freebsd-net" in the body of the message > -- Bosko Milekic bmilekic@unixdaemons.com bmilekic@FreeBSD.org To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020702130135.A17370>