Date: Mon, 30 Oct 2000 10:44:57 -0800 From: Alfred Perlstein <bright@wintelcom.net> To: Bosko Milekic <bmilekic@dsuper.net> Cc: freebsd-net@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <20001030104457.E22110@fw.wintelcom.net> In-Reply-To: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>; from bmilekic@dsuper.net on Mon, Oct 30, 2000 at 01:20:52PM -0500 References: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>
next in thread | previous in thread | raw e-mail | index | archive | help
* Bosko Milekic <bmilekic@dsuper.net> [001030 10:16] wrote: > > [cross-posted to freebsd-arch and freebsd-net, please continue > discussion on freebsd-net] > > Hello, > > I recently wrote an initial "scratch pad" design for per-CPU mbuf > lists (in the MP case). The design consists simply of introducing > these "fast" lists for each CPU and populating them with mbufs on bootup. > Allocations from these lists would not need to be protected with a mutex > as each CPU has its own. The general mmbfree list remains, and remains > protected with a mutex, in case the per-CPU list is empty. > My initial idea was to leave freeing to the general list, and have a > kproc "daemon" periodically populate the "fast" lists. This would have of > course involved the addition of a mutex for each "fast" list as well, in > order to insure synch with the kproc. However, in the great majority of > cases when the kproc would be sleeping, the acquiring of the mutex for > the fast list would be very cheap, as waiting for it would never be an > issue. > Yesterday, Alfred pointed me to the HOARD web page and made several > suggestions... all worthy of my attention. > The changes I have decided to make to the design will make the system > work as follows: > > - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark") > number of mbufs, typically... more on this below. > > - The general (already existing) mmbfree list; mutex protected, global > list, in case the fast list is empty for the given CPU. > > - Allocations; all done from "fast" lists. All are very fast, in the > general case. If no mbufs are available, the general mmbfree list's > lock is acquired, and an mbuf is made from there. If no mbuf is > available, even from the general list, we let go of the lock and > allocate a page from mb_map and drop the mbufs onto our fast list, from > which we grab the one we need. If mb_map is starved, then: > (a) if M_NOWAIT, return ENOBUFS > (b) go to sleep, if timeout, return ENOBUFS > (c) not timeout, so got a wakeup, the wakeup was accompanied with the > acquiring of the mmbfree general list. Since we were sleeping, we are > insured that there is an mbuf waiting for us on the general mmbfree > list, so we grab it and drop the lock (see the "freeing" section on > why we know there's one on mmbfree). > > - Freeing; First, if someone is sleeping, we grab the mmbfree global > list mutex and drop the mbuf there, and then issue a wakeup. If nobody > is sleeping, then we proceed as follows: I like this idea here, you could do that as a general way of noting that there's a shortage and free to the global pool even if you're below the low watermark that I discuss below... > (a) if our fast list does not have over "w" mbufs, put the mbuf on > our fast list and then we're done > (b) since our fast list already has "w" mbufs, acquire the mmbfree > mutex and drop the mbuf there. You want to free in chunks, see below for suggestions. > Things to note: > > - note that if we're out of mbufs on our fast list, and the general > mmbfree list has none available either, and mb_map is starved, even > though there may be free mbufs on other CPU's fast lists, we will > return ENOBUFS. This behavior will usually be an indication of a > wrongly chosen watermark ("w") and we will have to consider how to > inform our users on how to properly select a watermark. I already > have some ideas for alternate situations/ways of handeling this, but > will leave this investigation for later. > > - "w" is a tunable watermark. No fast list will ever contain more than > "w" mbufs. This presents a small problem. Consider a situation where > we initially set w = 500; consider we have two CPUs; consider CPU1's > fast list eventually gets 450 mbufs, and CPU2's fast list gets 345. > Consider then that we decide to set w = 200; Even though all > subsequent freeing will be done to the mmbfree list, unless we > eventually go under the 200 mark for our free list, we will likely > end up sitting with > 200 mbufs on each CPU's fast list. The idea I > presently have is to have a kproc "garbage collect" > w mbufs on the > CPUs' fast lists and put them back onto the mmbfree general list, if > it detects that "w" has been lowered. > > I'm looking for input. Please feel free to comment with the _specifics_ > of the system in mind. > > Thanks in advance to Alfred who has already generated input. :-) Oops, I think I wasn't clear enough, the idea is to have a low AND a high watermark, let's consider you have hw (high water) at 500 and lw (low water at 250). The point being that: 1) if you are freeing mbufs and hit the highwater mark on your frastlist you free into the general pool (hw - lw) mbufs from your fastlist 2) if you are allocating mbufs and have 0 on your fastlist you aquire (lw) mbufs from the general pool into your fast list. this should avoid a ping pong affect and at the same time allow the last problem you spoke about to be addressed better. More tricks that can be done: Since you only free from low water to high water you can do this to avoid a linked list traversal for counting, keep all mbufs below your low watermark on a seperate fastlist along with a count, when you free mbufs past your lowwater stick them on a seperate list, and maintain a count, when the count on that list becomes greater than hw-lw then you can just dump it into the global list with just a pointer swap and a bumping the count in the global freelist. You can also keep the mbufs in the global list in a special way so that you can do chunk allocations from it by simply using the m_nextpkt flag on the mbuf to point to the next "chunk" of mbufs that are hung off of m_next, you can hijack a byte out of the m_data to keep the count. -- -Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org] "I have the heart of a child; I keep it in a jar on my desk." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001030104457.E22110>