FreeBSD Mail Archives

Date:      Tue, 18 Feb 2003 13:48:36 -0500
From:      Bosko Milekic <bmilekic@unixdaemons.com>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        freebsd-arch@FreeBSD.ORG
Subject:   Re: mb_alloc cache balancer / garbage collector
Message-ID:  <20030218134836.A70583@unixdaemons.com>
In-Reply-To: <200302181757.h1IHvjaC051829@apollo.backplane.com>; from dillon@apollo.backplane.com on Tue, Feb 18, 2003 at 09:57:45AM -0800
References:  <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com> <20030217192418.A67144@unixdaemons.com> <20030217192952.A67225@unixdaemons.com> <200302180101.h1I11AWr001132@apollo.backplane.com> <20030217203306.A67720@unixdaemons.com> <200302180458.h1I4wQiA048763@apollo.backplane.com> <20030218093946.A69621@unixdaemons.com> <200302181757.h1IHvjaC051829@apollo.backplane.com>


On Tue, Feb 18, 2003 at 09:57:45AM -0800, Matthew Dillon wrote:
> :   I've looked at integrating these with the general all-purpose system
> :   allocator (UMA).  I ran into several issues that are not, to my
> :   knowledge, easily solved without ripping into UMA pretty badly.  I've
> :   mentionned these before.  One of the issues is the keep-cache-lock
> :   across grouped (m_getcl(), m_getm()) allocations and grouped
> :   de-allocations (m_freem()) for as long as possible.  The other issue
> :   has to do with keeping the common allocation and free cases down to
> :   one function call.  Further, the mbuf code does special things like
> :   call drain routines when completely exhausted and although I'm not
> :   100% certain, I can almost guarantee that making sure these work
> :   right with UMA is going to take a lot of ripping into it.  I'd like
> :   to avoid ripping into a general-purpose allocator that I think needs
> :   to have less rather than more application-specific complexities.
> 
>     Lets separate out the pure efficiency issues from special feature
>     support.  The cache locking issue is really just an efficiency issue,
>     easily solved with a little work on UMA.  Something like this for
>     example:
> 
> 	void **uma_lock = NULL;
> 
> 	/*
> 	 * use of *uma_lock is entirely under the control of UMA.  It
> 	 * can  release block and reobtain, release and obtain another
> 	 * lock, or not use it at all (leave it NULL).  The only 
> 	 * requirement is that you call uma_cache_unlock(&uma_lock) 
> 	 * after you are through and that you not block in between UMA 
> 	 * operations.
> 	 */
> 	uma_cache_free(&uma_lock, ...) ... etc
> 	uma_cache_alloc(&uma_lock, ...) ... etc
> 
> 	uma_cache_unlock(&uma_lock);
> 
>     Which would allow UMA to maintain a lock through a set of operations,
>     at its sole discretion.  If the lock were made a real mutex then we
>     could even allow the caller to block in between UMA operations by
>     msleep()ing on it.  I've used this trick on a more global basis on
>     embedded systems... the 'uma_lock' equivalent actually winds up being
>     part of the task structure allowing it to be used universally by
>     multiple subsystems.  (which, by the way, would allow one to get
>     rid of the mutex argument to msleep() if it were done that way
>     in FreeBSD).

  It's not quite that simple.  You would also have to teach it how to
  drop the lock if one of the allocations fails (or if it has to go to
  another cache) and how to tell the caller that it has done that.
  That means that you'd be introducing more modifications to the API and
  making it more complicated than it should be (see the
  MBP_PERSIST{,ENT} implementation for the mbuf allocator).
  In most cases, you don't need to do the grouped-cache-lock thing,
  which is why I think that it's not worth complicating UMA just so the
  mbuf code can use it.  The fact that the mbuf code uses it is due to
  the way the mbuf object itself works.  That is, there are situations
  in which you only allocate an mbuf, and situations where you need both
  the mbuf and the cluster.  You want both situations to be fast and
  effectively cost one lock/unlock in the common case.

>     The mbuf draining issue is more of an issue.

  It is.  So is keeping the common case down to one function call
  without removing the generality of UMA.  I have to keep bringing this
  one up; if we suddenly start to increase the number of function calls
  required to allocate (and CONFIGURE) an mbuf, then we'll also be
  quadrupaling the number of function calls needed to allocate an mbuf
  _and_ a cluster (and CONFIGURE them).  This influences overall
  performance more than one may think.

  There's also the reference counting issue.  We've been through this
  before, actually, on more than one occasion.

> :  Yes, you're right, but the difference is that in most cases, with the
> :  kproc, you'll minimize the cost of most of the allocations and frees
> :  because the kproc will have done the moving in less time.
> 
>     Why would the kproc minimize the cost of the allocations? 
> 
>     Try to estimate the efficiency of the following three methods:
> 
>     * The kproc allocating 200 mbufs per scheduled wakeup and the
>       client then making 200 allocations via the local cpu cache.
> 
>       (2 Context switches for every 200 allocations)
> 
>     * The client making 200 allocations via the local cpu cache,
>       the local cpu cache running out, and the allocator doing a bulk
>       allocation of 20 mbufs at a time.
> 
>       (1 VM/global mutex interaction for every 20 allocations).

  Actually, it's more than that.  There are supporting structures
  required for every bucket-worth you allocate.  So you need to allocate
  those supporting structures as well.

>     * The kproc uses idle cycles to pre-allocate N mbufs in the per-cpu
>       cache(s).
> 
>       (potentially no overhead if idle cycles are available)
> 
> 
>     I would argue that the kproc method only exceeds the on the fly
>     method if the system has lots of idle cycles for the kproc to run in.
>     Under heavy loads, the on-the-fly method is going to win hands down
>     (in my opinion).  Under light loads we shouldn't care if we are
>     slightly less efficiency since we would become more efficient as the
>     load increases.
> 
>     Consider the tuning you would have to do under heavy loads to minimize
>     the number of kproc wakeups.  And, also, note that if your goal is 
>     for the kproc to never have to wakeup then you are talking about a
>     situation where the on-the-fly mechanism would equivalently not have
>     to resort to the global cache.  The on-the-fly mechanism is trivially
>     tunable, the kproc mechanism is not.
> 
> :  I understand that the pageout daemon probably employs an algorithm
> :  that can get de-stabilized by large shifting of memory from one
> :  subsystem to another.  However, my argument is that the effect of
> :  moving slightly larger chunks of memory for network buffers is more
> :  good than bad.  There are more common cases than there are corner
> :  cases and for the corner cases I think that I could work out a decent
> :  recovery mechanism (the kproc could be temporarily 'turned off,' for
> :  example).
> 
>     I agree as long as the phrase is 'slight larger chunks...'.  But
>     that same argument applies to on-the-fly allocation from the global
>     cache, and as I point out above when you have a kproc you still have
>     to decide how long (how much latency) to allow that kproc to
>     introduce, which limits how many mbufs it should try to allocate from
>     the global cache, right?
> 
> :  Here's what I think I'll do in order to get what I'm sure we both want
> :  immediately without slowing down progress.  I'm going to implement the
> :  on-the-fly freeing to VM case (this is pretty trivial).  I'll present
> :  that and we can get that into the tree (so that we can at least
> :  recover resources following network spikes).  I'll keep the kproc code
> :  here and try to tune it to demonstrate eventually that it does the
> :  right thing and that corner cases are minimized.  I'll also try
> :  varying the number of objects per bucket, especially in the cluster
> :  case, and see where we go from there.  Keep in mind that because this
> :  is a specific network-buffer allocator, we may be able to get away
> :  with moving larger chunks of objects from a kproc without necessarily
> :  incurring all the bad effects of general-purpose allocation systems.
> :...
> :  It's an interesting corner case, but instead of completely trashing
> :  the kproc idea (which does gain us something in common cases by
> :  minimizing interactions with VM), I'll see if I can tune it to react
> :  properly.  I'll look at what kind of gains we can get from more
> :  conservative moves from the kproc vis-a-vis larger buckets.  It's easy
> :  to tune these things without ripping anything else apart, specifically
> :  because network buffers are allocated in their own special way.
> :  
> :  Matt, thanks for still reading the lists and remaining concerned.
> :
> :-- 
> :Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org
> 
>     I think it is well worth you implementing both and making them 
>     switchable with sysctl's (simply by adjusting two different sets of
>     hysteresis levels, for example).  Then you can test both under load
>     to see if the kproc is worth it.  It might well turn out that the 
>     kproc is a good idea but that on-the-fly allocation and deallocation
>     is necessary to handle degenerate situations.  Or it might turn out
>     that the kproc creates more problems then it solves.  Or it might turn
>     out that the on-the-fly allocation and deallocation code is so close
>     to the kproc code in regards to efficiency that there is no real 
>     reason to have the kproc.  Or it might turn out that the kproc's best
>     use is to recover memory after the machine has finished doing some
>     real hard networking work and is now becoming more idle.
> 
>     Obviously my opinion is heavily weighted towards on-the-fly.  At
>     the same I see no reason why you can't develop your kproc idea and
>     even commit it.  You are, after all, the person who is taking the
>     time to work on it.

  Hmmmmmm... both!  The ideal situation would be to have the kproc
  run for not-too-loaded situations but once the load gets high, recover
  through the on-the-fly code.  Now the problem is shifted to
  determining when we're "not-too-loaded" (admittedly, this is not as
  easy as it sounds, as "load" is not purely defined by the
  state of network buffers).  I'll implement the on-the-fly case and
  commit that, barring other disagreements and take it from there
  because it's extremely important for me to have the system free
  resources back after a spike at this stage, at the very least.

  With that said, does anyone disagree with this approach?

> 					-Matt
> 					Matthew Dillon 
> 					<dillon@backplane.com>

-- 
Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030218134836.A70583>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation