Date: Mon, 30 Oct 2000 10:44:57 -0800 From: Alfred Perlstein <bright@wintelcom.net> To: Bosko Milekic <bmilekic@dsuper.net> Cc: freebsd-net@FreeBSD.ORG Subject: Re: MP: per-CPU mbuf allocation lists Message-ID: <20001030104457.E22110@fw.wintelcom.net> In-Reply-To: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>; from bmilekic@dsuper.net on Mon, Oct 30, 2000 at 01:20:52PM -0500 References: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>
next in thread | previous in thread | raw e-mail | index | archive | help
* Bosko Milekic <bmilekic@dsuper.net> [001030 10:16] wrote:
>
> [cross-posted to freebsd-arch and freebsd-net, please continue
> discussion on freebsd-net]
>
> Hello,
>
> I recently wrote an initial "scratch pad" design for per-CPU mbuf
> lists (in the MP case). The design consists simply of introducing
> these "fast" lists for each CPU and populating them with mbufs on bootup.
> Allocations from these lists would not need to be protected with a mutex
> as each CPU has its own. The general mmbfree list remains, and remains
> protected with a mutex, in case the per-CPU list is empty.
> My initial idea was to leave freeing to the general list, and have a
> kproc "daemon" periodically populate the "fast" lists. This would have of
> course involved the addition of a mutex for each "fast" list as well, in
> order to insure synch with the kproc. However, in the great majority of
> cases when the kproc would be sleeping, the acquiring of the mutex for
> the fast list would be very cheap, as waiting for it would never be an
> issue.
> Yesterday, Alfred pointed me to the HOARD web page and made several
> suggestions... all worthy of my attention.
> The changes I have decided to make to the design will make the system
> work as follows:
>
> - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark")
> number of mbufs, typically... more on this below.
>
> - The general (already existing) mmbfree list; mutex protected, global
> list, in case the fast list is empty for the given CPU.
>
> - Allocations; all done from "fast" lists. All are very fast, in the
> general case. If no mbufs are available, the general mmbfree list's
> lock is acquired, and an mbuf is made from there. If no mbuf is
> available, even from the general list, we let go of the lock and
> allocate a page from mb_map and drop the mbufs onto our fast list, from
> which we grab the one we need. If mb_map is starved, then:
> (a) if M_NOWAIT, return ENOBUFS
> (b) go to sleep, if timeout, return ENOBUFS
> (c) not timeout, so got a wakeup, the wakeup was accompanied with the
> acquiring of the mmbfree general list. Since we were sleeping, we are
> insured that there is an mbuf waiting for us on the general mmbfree
> list, so we grab it and drop the lock (see the "freeing" section on
> why we know there's one on mmbfree).
>
> - Freeing; First, if someone is sleeping, we grab the mmbfree global
> list mutex and drop the mbuf there, and then issue a wakeup. If nobody
> is sleeping, then we proceed as follows:
I like this idea here, you could do that as a general way of noting that
there's a shortage and free to the global pool even if you're below the
low watermark that I discuss below...
> (a) if our fast list does not have over "w" mbufs, put the mbuf on
> our fast list and then we're done
> (b) since our fast list already has "w" mbufs, acquire the mmbfree
> mutex and drop the mbuf there.
You want to free in chunks, see below for suggestions.
> Things to note:
>
> - note that if we're out of mbufs on our fast list, and the general
> mmbfree list has none available either, and mb_map is starved, even
> though there may be free mbufs on other CPU's fast lists, we will
> return ENOBUFS. This behavior will usually be an indication of a
> wrongly chosen watermark ("w") and we will have to consider how to
> inform our users on how to properly select a watermark. I already
> have some ideas for alternate situations/ways of handeling this, but
> will leave this investigation for later.
>
> - "w" is a tunable watermark. No fast list will ever contain more than
> "w" mbufs. This presents a small problem. Consider a situation where
> we initially set w = 500; consider we have two CPUs; consider CPU1's
> fast list eventually gets 450 mbufs, and CPU2's fast list gets 345.
> Consider then that we decide to set w = 200; Even though all
> subsequent freeing will be done to the mmbfree list, unless we
> eventually go under the 200 mark for our free list, we will likely
> end up sitting with > 200 mbufs on each CPU's fast list. The idea I
> presently have is to have a kproc "garbage collect" > w mbufs on the
> CPUs' fast lists and put them back onto the mmbfree general list, if
> it detects that "w" has been lowered.
>
> I'm looking for input. Please feel free to comment with the _specifics_
> of the system in mind.
>
> Thanks in advance to Alfred who has already generated input. :-)
Oops, I think I wasn't clear enough, the idea is to have a low AND a high
watermark, let's consider you have hw (high water) at 500 and lw (low water
at 250). The point being that:
1) if you are freeing mbufs and hit the highwater mark on your frastlist
you free into the general pool (hw - lw) mbufs from your fastlist
2) if you are allocating mbufs and have 0 on your fastlist you aquire
(lw) mbufs from the general pool into your fast list.
this should avoid a ping pong affect and at the same time allow the
last problem you spoke about to be addressed better.
More tricks that can be done:
Since you only free from low water to high water you can do this
to avoid a linked list traversal for counting, keep all mbufs below
your low watermark on a seperate fastlist along with a count, when
you free mbufs past your lowwater stick them on a seperate list,
and maintain a count, when the count on that list becomes greater
than hw-lw then you can just dump it into the global list with just
a pointer swap and a bumping the count in the global freelist.
You can also keep the mbufs in the global list in a special way
so that you can do chunk allocations from it by simply using
the m_nextpkt flag on the mbuf to point to the next "chunk" of
mbufs that are hung off of m_next, you can hijack a byte out
of the m_data to keep the count.
--
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."
To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001030104457.E22110>
