From owner-freebsd-net  Mon Oct 30 16:14:53 2000
Delivered-To: freebsd-net@freebsd.org
Received: from field.videotron.net (field.videotron.net [205.151.222.108])
	by hub.freebsd.org (Postfix) with ESMTP id 3246C37B479
	for <freebsd-net@FreeBSD.ORG>; Mon, 30 Oct 2000 16:14:48 -0800 (PST)
Received: from modemcable213.3-201-24.mtl.mc.videotron.ca ([24.201.3.213])
 by field.videotron.net (Sun Internet Mail Server sims.3.5.1999.12.14.10.29.p8)
 with ESMTP id <0G390027VQ0MEK@field.videotron.net> for freebsd-net@FreeBSD.ORG; Mon, 30 Oct 2000 19:14:46 -0500 (EST)
Date: Mon, 30 Oct 2000 19:19:19 -0500 (EST)
From: Bosko Milekic <bmilekic@dsuper.net>
Subject: Re: MP: per-CPU mbuf allocation lists
In-reply-to: <200010302240.e9UMeWF18172@mass.osd.bsdi.com>
X-Sender: bmilekic@jehovah.technokratis.com
To: Mike Smith <msmith@mass.osd.bsdi.com>
Cc: freebsd-net@FreeBSD.ORG
Message-id: <Pine.BSF.4.21.0010301822100.31430-100000@jehovah.technokratis.com>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-net@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org


	I'm going to attempt to respond to all, if not most, of the feedback
  I've received on this in this Email, because Mike's Email seems to touch
  on a little bit of everything.

On Mon, 30 Oct 2000, Mike Smith wrote:

> Have you by any chance done any profiling to determine whether contention 
> for the free mbuf list is actually a performance issue, or is this just 
> one of those "hey, this would be cool" design decisions?

	First of all, if you're referring to the concept of having per-CPU
  lists, it follows from (and is justified by): http://www.hoard.org/
  Of course, we're not looking at that sort of general allocator, but the
  concept is very similar, and the performance gain is similarily
  justified.
  	However, in order to really be able to see a performance gain,
  several other things need to be "threaded." I certainly wouldn't waste
  the little free time I already have in doing something like this just
  because I thought it was "cool," or rather, if I did, I'm sure that I
  wouldn't be so selfish as to propose it to everyone and waste their time
  (including yours) as well.
	I hope this clears up most of the "what's the performance gain?" or
  "why would you do this?" issues. If not, I respectfully urge people to
  respond to me privately on this and I'll do my best to answer.

[...]
> Starvation of the general list should result in the general list being 
> populated, not the fast list.  In this case, the general list will remain 
> depleted until mbufs are freed, which will blow mb_map out faster than is 
> otherwise desirable.

	You're right. I've changed this in the revised design, presented
  below...

> This is a half-hearted "donation" algorithm.  It might make sense to 
> waste some cycles in the "sleeping" case to lock other cpu's "fast" lists 
> and steal mbufs from them...

	"Stealing" has been considered, but unfortunately, it introduces
  some hysterics into the system which I doubt certain people would go for.
  The idea will be to prevent the need for that situation, while optimizing
  allocation speed. The revised plan below attempts to insure that high and
  low watermarks are followed and, if set properly, there will not be any
  problems unless something else is wrong to begin with.

> The watermark-lowering operation is likely to be very infrequent.  As 
> such, it would hardly hurt for it to scan each of the "fast" lists and 
> steal excess mbufs back into the global pool.

	I agree, which is why I had initially proposed that it be done from a
  kproc... but the low/high watermark system eliminates the need for
  repopulating the general list from a kproc... see below.

  Here is the revised plan:

  - free lists; each CPU has a set of two lists, and that set is called a
    "fast" list. Not much changed except that there are two; the first will
    be referred to as "F1" and the second, "F2."
 
  - the general mmbfree list; nothing changed here.

  - Allocations; here's how they would work, more or less:
1:     set mb_ptr = F2->head;
	 if (mb_ptr == NULL) {
	     set mb_ptr = F1->head;
	     if (mb_ptr is still == NULL) {
2:	        try global mmbfree list, if nothing, then go ahead and try
		  to allocate from mb_map, place on the mmbfree list and try
		  again and, if still nothing, then either:
		   
		   (a) wait, if M_WAIT; waiting would work as previously
		   mentionned, if there are people waiting, then the free will
		   happen to mmbfree and the first waiting thread will be
		   awoken, and if that thread at that point still finds nothing
		   on its CPU lists, then it will try mmbfree once again.
		   
		   (b) ENOBUFS otherwise, or if wait fails.
3:         } else
	        the allocation is completed from F2;
	} else
	   the allocation is completed from F1;

      now, it's simple:
	
	if (mb_ptr != NULL) {
	    Setup the mbuf;
	}

   Notice that everything from 2 to 3 above is what already happens and
   that there is little hysteria introduced, when considering that the
   majority of allocations will be done directly from either F1 or F2.

  - Freeing; freeing is almost done as I previously suggested, with a
    little bit of a modification, in order to insure that mmbfree gets
    repopulated. Here it is, more or less:
      if (someone sleeping) 
	   drop mbuf in mmbfree global list and issue wakeup;
	else if (((number in F1) + 1) > (low watermark)) {
		put mbuf in F2;
		increment mbuf count in F2;
		if ((mbuf count in F2) > (high_wm - low_wm)) {
			put all of F2 into mmbfree, now F2 is clear;
		}
	} else
		put mbuf in F1;

  What this system does is combine everything from the initial design with
  Mike's pointer above (i.e. when allocating from mb_map, allocate to mmbfree)
  and one of Alfred's two suggestions, which I'll quote and explain why
  here:

  (1) if you are freeing mbufs and hit the high watermark on your fast
  list, you free into the general pool (hw - lw) mbufs from your fast list.

     Alfred's added suggestion: Seperate fast list into a set of two, one
     to keep up to (lw) mbufs, and the other to keep the difference, so
     that when it comes time to transfer to mmbfree, it will be much
     faster.

  The new design (above) takes this into account.

  (2) If you are allocating mbufs and have 0 on your fastlist you aquire
  (lw) mbufs from the general pool into your fast list.

     Alfred's added suggestion: Keep mbufs in mmbfree classed as "chunks"
     (i.e. of (lw) number of mbufs, for example).

   The new design drops this idea completely. First of all, let's say that
   we do consider (2), then we would have to go through a costly count of
   the mbufs in the mmbfree list every time we're doing the transfer. The
   suggestion above says that we should keep the mbufs in chunks, to make
   the transfer less costly, but it's flawed, because in that case, there
   are two possibilities:
   
      (1) all "chunks" have same size, (lw); this is unreasonable because
	(lw) is tunable and can change.

	(2) "chunks" have various sizes depending on when (lw) changes, one
	should be combining smaller chunks into larger ones (and vice versa?)
	in those cases; this is unreasonable because it will add too much
	overhead into some of those allocations. I'd rather deal with not
	having (2) at all and keep the speed than deal with this case.

	Of course, this isn't set in stone, and I'd like to know from those
  who will argue, if any, that (2) should still be implemented, why it is
  necessary. (1) alone will eliminate ping-pong effects.

  Regards,
  Bosko Milekic
  bmilekic@technokratis.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-net" in the body of the message