Date: Tue, 2 Jul 2002 08:56:40 -0700 From: Luigi Rizzo <rizzo@icir.org> To: Bosko Milekic <bmilekic@unixdaemons.com> Cc: net@FreeBSD.ORG Subject: Re: Mbuf allocator performance (was Should we keep a cache of mbuf+cluster ready for use ?) Message-ID: <20020702085640.B5854@iguana.icir.org> In-Reply-To: <20020702090354.A78632@unixdaemons.com>; from bmilekic@unixdaemons.com on Tue, Jul 02, 2002 at 09:03:54AM -0400 References: <20020702001050.B2250@iguana.icir.org> <20020702080423.A69693@unixdaemons.com> <20020702090354.A78632@unixdaemons.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi, you can read the numbers I posted as you like, but in the non-SMP case performance is not "seemingly" better, it IS better. Before people think differently: the code i posted is _perfectly safe_ to use under the assumption that I used -- i.e. that you hold Giant when manipulating the free list -- on both non-SMP (where i tested it, and where it does improve performancw) and SMP systems (where I have no idea about performance, but i might have some numbers by tomorrow), under the assumption that I used -- i.e. that you hold Giant when manipulating the free list. The above assumption holds in the device drivers i patched, and also (at the time of this writing, and I suspect for quite a bit, although I and hopefully others are working on this) in all the network stack. And of course you can #ifdef the code out for the SMP case if you don't trust it so you won't pessimize the SMP case. Also: I am not violating _any_ interface. I am just holding apart a small set of buffers for future use, which is exactly the same thing that each and every network device driver in FreeBSD does when it links them in the receive ring -- no guarantee when/if they will be freed, certainly none of them is freed even if the system runs out of mbufs (unless the network interface is brought down, which i doubt it happens). This said, a detailed answer to your points: #1 I think your comment on the sleepers and mbuf exhaustion is not correct. I am only talking of holding apart a small set of buffers+clusters, in the order of ~50 blocks, and you can restrict the consumers of the free pool to those who do not want to sleep, so there is not need to special-case the code for the sleepers. Or, if you really want to special case it, the first sleeper which finds the free pool empty and needs to go to sleep will reset max_pool_size to 0 forcing the same behaviour we have now with correct handling of wakeup and such. #2 given that all i am asking for is defining an interface for this type of optimization, when the time comes that Giant goes, and assuming I am not retired yet, it will still change nothing for the non-SMP case, and for the SMP case we can run our performance tests and if it proves to be useful we can put in whatever is necessary. #3 easy to fix (assuming this is really a problem) -- just have the free pool organized as a list of pointers to the mbufs, rather than using the m_nextpkt field as the link field. Because the free list is short and fixed size you can even use an array to hold it. #4 same answer of #2: if/when... we can easily fix things in just one place. Until then, we are just pessimizing things for no good reason. In terms of API, since we both agree that we need one (and we do need one, otherwise driver writers will just go ahead and do their own, replicating work in each driver, as I am sure I^hthey will start doing now), may I suggest that we export two-- one where the caller is assumed to already hold a lock on the CPU it is running on, and one which also does the necessary locking upfront and calls the latter afterwards ? This way the former will be usable right now and efficiently from all contexts which run under Giant, and easy to upgrade when/if we need it. And, I am sorry to say... you raise some valid points, but without performance numbers to justify them, it is hard to tell how relevant they are in practice. I have made too many mistakes of this kind myself to believe my intuition. cheers luigi On Tue, Jul 02, 2002 at 09:03:54AM -0400, Bosko Milekic wrote: > > There are several problems with your "simple" code that make it simpler > > than what's in -CURRENT and in -STABLE and therefore yield [obviously] > > seemingly better times. > > > > 1) sleeping in the case of a system exhausted of mbufs and clusters is > > now broken because code that frees to this "combined" list will not > > generate wake ups; even if you build this functionality in, you would > > have to code the wakeup for a "special" case where one obtains both a > > cluster and an mbuf; it'll get complicated. > > > > 2) no per-CPU caches, you bring contention (in -CURRENT) from per-CPU > > back to global, which you probably don't care about now, but which will > > matter when Giant goes completely. > > > > 3) you violate write/read producer/consumer model. In -CURRENT, we try > > to not have the code doing the freeing write to any part of the mbuf, > > thus hopefully avoiding cache invalidation from occuring during a free > > that happens from a thread that doesn't usually write to the mbuf at > > all. > > Forgot, also, in -CURRENT: > > 4) breaks framework in mb_alloc that groups pages of mbufs and > clusters together allowing them to be freed in the future, if/when we > decide to implement lazy freeing; it would do the same thing if we > mvoed mbuf allocations to UMA, because grouping together mbufs and > clusters on a separate list will violate that scheme. > > > I think that because of (1) and (2), especially, and because the kernel > > in -CURRENT is the way it is right now, you're getting better > > performance numbers. Try exhausting the system and seeing how far you > > get. I suspect you'll eventually just hang (because you'll allocate a > > bunch of stuff violating the real interface and then won't be able to > > wakeup and Do The Right Thing). > > > > While, as I said before, I agree that we should have an interface that > > allocates both an mbuf and a cluster, I don't think that this is the > > right approach. I believe that the correct approach would be to couple > > the mbuf and cluster allocation under a single per-CPU lock/unlock > > operation, which isn't too difficult to do. > 1 To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20020702085640.B5854>