Date: Mon, 30 Oct 2000 19:19:19 -0500 (EST) From: Bosko Milekic <bmilekic@dsuper.net> To: Mike Smith <msmith@mass.osd.bsdi.com> Cc: freebsd-net@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <Pine.BSF.4.21.0010301822100.31430-100000@jehovah.technokratis.com> In-Reply-To: <200010302240.e9UMeWF18172@mass.osd.bsdi.com>
next in thread | previous in thread | raw e-mail | index | archive | help
I'm going to attempt to respond to all, if not most, of the feedback I've received on this in this Email, because Mike's Email seems to touch on a little bit of everything. On Mon, 30 Oct 2000, Mike Smith wrote: > Have you by any chance done any profiling to determine whether contention > for the free mbuf list is actually a performance issue, or is this just > one of those "hey, this would be cool" design decisions? First of all, if you're referring to the concept of having per-CPU lists, it follows from (and is justified by): http://www.hoard.org/ Of course, we're not looking at that sort of general allocator, but the concept is very similar, and the performance gain is similarily justified. However, in order to really be able to see a performance gain, several other things need to be "threaded." I certainly wouldn't waste the little free time I already have in doing something like this just because I thought it was "cool," or rather, if I did, I'm sure that I wouldn't be so selfish as to propose it to everyone and waste their time (including yours) as well. I hope this clears up most of the "what's the performance gain?" or "why would you do this?" issues. If not, I respectfully urge people to respond to me privately on this and I'll do my best to answer. [...] > Starvation of the general list should result in the general list being > populated, not the fast list. In this case, the general list will remain > depleted until mbufs are freed, which will blow mb_map out faster than is > otherwise desirable. You're right. I've changed this in the revised design, presented below... > This is a half-hearted "donation" algorithm. It might make sense to > waste some cycles in the "sleeping" case to lock other cpu's "fast" lists > and steal mbufs from them... "Stealing" has been considered, but unfortunately, it introduces some hysterics into the system which I doubt certain people would go for. The idea will be to prevent the need for that situation, while optimizing allocation speed. The revised plan below attempts to insure that high and low watermarks are followed and, if set properly, there will not be any problems unless something else is wrong to begin with. > The watermark-lowering operation is likely to be very infrequent. As > such, it would hardly hurt for it to scan each of the "fast" lists and > steal excess mbufs back into the global pool. I agree, which is why I had initially proposed that it be done from a kproc... but the low/high watermark system eliminates the need for repopulating the general list from a kproc... see below. Here is the revised plan: - free lists; each CPU has a set of two lists, and that set is called a "fast" list. Not much changed except that there are two; the first will be referred to as "F1" and the second, "F2." - the general mmbfree list; nothing changed here. - Allocations; here's how they would work, more or less: 1: set mb_ptr = F2->head; if (mb_ptr == NULL) { set mb_ptr = F1->head; if (mb_ptr is still == NULL) { 2: try global mmbfree list, if nothing, then go ahead and try to allocate from mb_map, place on the mmbfree list and try again and, if still nothing, then either: (a) wait, if M_WAIT; waiting would work as previously mentionned, if there are people waiting, then the free will happen to mmbfree and the first waiting thread will be awoken, and if that thread at that point still finds nothing on its CPU lists, then it will try mmbfree once again. (b) ENOBUFS otherwise, or if wait fails. 3: } else the allocation is completed from F2; } else the allocation is completed from F1; now, it's simple: if (mb_ptr != NULL) { Setup the mbuf; } Notice that everything from 2 to 3 above is what already happens and that there is little hysteria introduced, when considering that the majority of allocations will be done directly from either F1 or F2. - Freeing; freeing is almost done as I previously suggested, with a little bit of a modification, in order to insure that mmbfree gets repopulated. Here it is, more or less: if (someone sleeping) drop mbuf in mmbfree global list and issue wakeup; else if (((number in F1) + 1) > (low watermark)) { put mbuf in F2; increment mbuf count in F2; if ((mbuf count in F2) > (high_wm - low_wm)) { put all of F2 into mmbfree, now F2 is clear; } } else put mbuf in F1; What this system does is combine everything from the initial design with Mike's pointer above (i.e. when allocating from mb_map, allocate to mmbfree) and one of Alfred's two suggestions, which I'll quote and explain why here: (1) if you are freeing mbufs and hit the high watermark on your fast list, you free into the general pool (hw - lw) mbufs from your fast list. Alfred's added suggestion: Seperate fast list into a set of two, one to keep up to (lw) mbufs, and the other to keep the difference, so that when it comes time to transfer to mmbfree, it will be much faster. The new design (above) takes this into account. (2) If you are allocating mbufs and have 0 on your fastlist you aquire (lw) mbufs from the general pool into your fast list. Alfred's added suggestion: Keep mbufs in mmbfree classed as "chunks" (i.e. of (lw) number of mbufs, for example). The new design drops this idea completely. First of all, let's say that we do consider (2), then we would have to go through a costly count of the mbufs in the mmbfree list every time we're doing the transfer. The suggestion above says that we should keep the mbufs in chunks, to make the transfer less costly, but it's flawed, because in that case, there are two possibilities: (1) all "chunks" have same size, (lw); this is unreasonable because (lw) is tunable and can change. (2) "chunks" have various sizes depending on when (lw) changes, one should be combining smaller chunks into larger ones (and vice versa?) in those cases; this is unreasonable because it will add too much overhead into some of those allocations. I'd rather deal with not having (2) at all and keep the speed than deal with this case. Of course, this isn't set in stone, and I'd like to know from those who will argue, if any, that (2) should still be implemented, why it is necessary. (1) alone will eliminate ping-pong effects. Regards, Bosko Milekic bmilekic@technokratis.com To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.21.0010301822100.31430-100000>