Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 30 Oct 2000 10:44:57 -0800
From:      Alfred Perlstein <bright@wintelcom.net>
To:        Bosko Milekic <bmilekic@dsuper.net>
Cc:        freebsd-net@FreeBSD.ORG
Subject:   Re: MP: per-CPU mbuf allocation lists
Message-ID:  <20001030104457.E22110@fw.wintelcom.net>
In-Reply-To: <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>; from bmilekic@dsuper.net on Mon, Oct 30, 2000 at 01:20:52PM -0500
References:  <Pine.BSF.4.21.0010301256580.30271-100000@jehovah.technokratis.com>

next in thread | previous in thread | raw e-mail | index | archive | help
* Bosko Milekic <bmilekic@dsuper.net> [001030 10:16] wrote:
> 
>   [cross-posted to freebsd-arch and freebsd-net, please continue
>   discussion on freebsd-net]
> 
>   Hello,
> 
>   	I recently wrote an initial "scratch pad" design for per-CPU mbuf
>   lists (in the MP case). The design consists simply of introducing
>   these "fast" lists for each CPU and populating them with mbufs on bootup.
>   Allocations from these lists would not need to be protected with a mutex
>   as each CPU has its own. The general mmbfree list remains, and remains
>   protected with a mutex, in case the per-CPU list is empty.
>   	My initial idea was to leave freeing to the general list, and have a
>   kproc "daemon" periodically populate the "fast" lists. This would have of
>   course involved the addition of a mutex for each "fast" list as well, in
>   order to insure synch with the kproc. However, in the great majority of
>   cases when the kproc would be sleeping, the acquiring of the mutex for
>   the fast list would be very cheap, as waiting for it would never be an
>   issue.
>   	Yesterday, Alfred pointed me to the HOARD web page and made several
>   suggestions... all worthy of my attention.
>   	The changes I have decided to make to the design will make the system
>   work as follows:
> 
>   - "Fast" list; a per-CPU mbuf list. They contain "w" (for "watermark")
>     number of mbufs, typically... more on this below.
>     
>   - The general (already existing) mmbfree list; mutex protected, global
>     list, in case the fast list is empty for the given CPU.
>     
>   - Allocations; all done from "fast" lists. All are very fast, in the
>     general case. If no mbufs are available, the general mmbfree list's
>     lock is acquired, and an mbuf is made from there. If no mbuf is
>     available, even from the general list, we let go of the lock and
>     allocate a page from mb_map and drop the mbufs onto our fast list, from
>     which we grab the one we need. If mb_map is starved, then:
> 	(a) if M_NOWAIT, return ENOBUFS
> 	(b) go to sleep, if timeout, return ENOBUFS
> 	(c) not timeout, so got a wakeup, the wakeup was accompanied with the
> 	acquiring of the mmbfree general list. Since we were sleeping, we are
> 	insured that there is an mbuf waiting for us on the general mmbfree
> 	list, so we grab it and drop the lock (see the "freeing" section on
> 	why we know there's one on mmbfree).
> 
>    - Freeing; First, if someone is sleeping, we grab the mmbfree global
>      list mutex and drop the mbuf there, and then issue a wakeup. If nobody
>      is sleeping, then we proceed as follows:

I like this idea here, you could do that as a general way of noting that
there's a shortage and free to the global pool even if you're below the
low watermark that I discuss below...

>      	(a) if our fast list does not have over "w" mbufs, put the mbuf on
> 	our fast list and then we're done
> 	(b) since our fast list already has "w" mbufs, acquire the mmbfree
> 	mutex and drop the mbuf there.

You want to free in chunks, see below for suggestions.

>   Things to note:
>   
>     - note that if we're out of mbufs on our fast list, and the general
> 	mmbfree list has none available either, and mb_map is starved, even
> 	though there may be free mbufs on other CPU's fast lists, we will
> 	return ENOBUFS. This behavior will usually be an indication of a
> 	wrongly chosen watermark ("w") and we will have to consider how to
> 	inform our users on how to properly select a watermark. I already
> 	have some ideas for alternate situations/ways of handeling this, but
> 	will leave this investigation for later.
>   
>     - "w" is a tunable watermark. No fast list will ever contain more than
> 	"w" mbufs. This presents a small problem. Consider a situation where
> 	we initially set w = 500; consider we have two CPUs; consider CPU1's
> 	fast list eventually gets 450 mbufs, and CPU2's fast list gets 345.
> 	Consider then that we decide to set w = 200; Even though all
> 	subsequent freeing will be done to the mmbfree list, unless we
> 	eventually go under the 200 mark for our free list, we will likely
> 	end up sitting with > 200 mbufs on each CPU's fast list. The idea I
> 	presently have is to have a kproc "garbage collect" > w mbufs on the
> 	CPUs' fast lists and put them back onto the mmbfree general list, if
> 	it detects that "w" has been lowered.
> 	
>   I'm looking for input. Please feel free to comment with the _specifics_
>   of the system in mind.
> 
>   Thanks in advance to Alfred who has already generated input. :-)

Oops, I think I wasn't clear enough, the idea is to have a low AND a high
watermark, let's consider you have hw (high water) at 500 and lw (low water
at 250).  The point being that:

1) if you are freeing mbufs and hit the highwater mark on your frastlist
   you free into the general pool (hw - lw) mbufs from your fastlist
2) if you are allocating mbufs and have 0 on your fastlist you aquire
   (lw) mbufs from the general pool into your fast list.

this should avoid a ping pong affect and at the same time allow the 
last problem you spoke about to be addressed better.

More tricks that can be done:

Since you only free from low water to high water you can do this
to avoid a linked list traversal for counting, keep all mbufs below
your low watermark on a seperate fastlist along with a count, when
you free mbufs past your lowwater stick them on a seperate list,
and maintain a count, when the count on that list becomes greater
than hw-lw then you can just dump it into the global list with just
a pointer swap and a bumping the count in the global freelist.

You can also keep the mbufs in the global list in a special way
so that you can do chunk allocations from it by simply using
the m_nextpkt flag on the mbuf to point to the next "chunk" of
mbufs that are hung off of m_next, you can hijack a byte out
of the m_data to keep the count.

-- 
-Alfred Perlstein - [bright@wintelcom.net|alfred@freebsd.org]
"I have the heart of a child; I keep it in a jar on my desk."


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20001030104457.E22110>