FreeBSD Mail Archives

Date:      Tue, 18 Feb 2003 09:39:46 -0500
From:      Bosko Milekic <bmilekic@unixdaemons.com>
To:        Matthew Dillon <dillon@apollo.backplane.com>
Cc:        freebsd-arch@FreeBSD.ORG
Subject:   Re: mb_alloc cache balancer / garbage collector
Message-ID:  <20030218093946.A69621@unixdaemons.com>
In-Reply-To: <200302180458.h1I4wQiA048763@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 17, 2003 at 08:58:26PM -0800
References:  <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com> <20030217192418.A67144@unixdaemons.com> <20030217192952.A67225@unixdaemons.com> <200302180101.h1I11AWr001132@apollo.backplane.com> <20030217203306.A67720@unixdaemons.com> <200302180458.h1I4wQiA048763@apollo.backplane.com>

On Mon, Feb 17, 2003 at 08:58:26PM -0800, Matthew Dillon wrote:
> :  You can't just 'free' a bunch of mbufs back to the VM.  You free them
> :  wherever you got them from (usually your pcpu cache).  If you exceed
> :  mbuf_limit on your pcpu cache you'll migrate a bucket over to the
> :  global cache, which is what you want.  However if your global cache
> :  becomes too 'blown up' as you say, then you may want to recover the
> :  unused physical pages.  Doing that directly from the free has several
> :  disadvantages;
> :  It can be expensive in more ways than one; for one, the VM call
> :  itself is extra overhead.  Secondly, sometimes freeing a page means
> :  traversing the cache until you hit a page worth of free mbufs to free,
> :  so even though you may really need to free a page you'll never
> :  actually get to freeing it unless you start traversing the list of
> :  buckets in the cache; and that's expensive for a simple free - common
> :  case or not.
> 
>     Remember you are talking about two memory subsystems here.  There
>     was a suggestion a little while back in the thread that a better
>     solution might be to integrate the mbuf allocator with UMA.  That's
>     really my main point.  Use UMA and solve the global cache -> global
>     VM issue in UMA.

   I've looked at integrating these with the general all-purpose system
   allocator (UMA).  I ran into several issues that are not, to my
   knowledge, easily solved without ripping into UMA pretty badly.  I've
   mentionned these before.  One of the issues is the keep-cache-lock
   across grouped (m_getcl(), m_getm()) allocations and grouped
   de-allocations (m_freem()) for as long as possible.  The other issue
   has to do with keeping the common allocation and free cases down to
   one function call.  Further, the mbuf code does special things like
   call drain routines when completely exhausted and although I'm not
   100% certain, I can almost guarantee that making sure these work
   right with UMA is going to take a lot of ripping into it.  I'd like
   to avoid ripping into a general-purpose allocator that I think needs
   to have less rather than more application-specific complexities.

>     I have to disagree with your idea of 'expense'.  At the point where
>     freeing things on-the-fly becomes 'too expensive' your kernel thread
>     will *already* be overloaded and messing up the system in other ways.
> 
>     Here's an example:  Lets say we have an extreme mbuf load.  Not so
>     much in allocations, but in the *rate* of allocation and the *rate*
>     of freeing.  Now lets say you hit a hysteresis point.  With the
>     thread idea you wakeup your thread and continue on your merry way.
>     You are assuming that your thread will be able to handle it.  But
>     this may not be true.
> 
>     Now lets say you are doing things on the fly and hit the hysteresis
>     point.  What will happen now is rather simple:  Once you go over the
>     upper bound you need to free mbufs until you hit the lower bound. 
>     You want to free more then one at a time for efficiency, but you *don't*
>     need to free all the mbufs at once.  What you do is simply
>     free, say, 5 mbufs at a time for every call to free an mbuf until
>     the levels drop to the lower bound.  In otherwords, latency can be
>     fully controlled with an on-the-fly solution because it is fully
>     self-pacing.
> 
>     Now lets go back and look at the thread.  Lets say something gets
>     unbalanced and you hit your upper bound again, and start the thread
>     going.  How many mbufs is the thread going to free at once?  Is it
>     going to free the entire wad required to get back to the lower bound?
>     How will this effect the latency of other processes?  Of the pageout
>     daemon, for example, or even of user processes which until your thread
>     started running were doing a fair job draining the TCP and UDP
>     buffers they've been processing.  Unlike the on-the-fly method you
>     can't really 'pace' the thread, because of the huge overhead in
>     going to sleep every few milliseconds verses the overhead of freeing
>     the mbufs.
> 
>     In otherwords, the question becomes:  How do you intend to control
>     the latency your thread is now causing in the system.  I can pace
>     the on-the-fly method trivially... in like four lines of code.  How
>     do you solve the same problem with your thread?  It isn't as simple
>     as giving it a fixed priority that is less then X and greater then Y.
> 
> :  By doing the freeing from the kproc context you're not interfering
> :  with parallel allocations but you're also not taking longer than it
> :  takes to just cache the data being freed for the free case.  That's a
> :  big advantage.  By having the kproc also fill the pcpu caches
> 
>     I disagree with this.  I don't see how the thread can possibly 
>     make a difference vis-a-vie parallel allocations.  They work 
>     approximately the same either way.  In making this statement you 
>     are assuming that your thread is getting cpu cycles that magically
>     don't interfere with anything else going on in the system.  I
>     don't think you can make this statement without some more analysis.
> 
>     If you agree that dynamically adjusting the hysteresis points 
>     results in fewer thread wakeups, those same adjustments will also
>     result in fewer 'extra' on-the-fly actions.
>
> :  according the the configurable watermarks you're ensuring to have a
> :  certain number of objects cached and ready for immediate allocations,
> :  again without taking longer than it takes to just retrieve the object
> :  being allocated from the cache for the allocation case.
> 
>     This is far from certain.  You are again assuming that your thread
>     is able to operate in a fixed period of time, without interfering
>     with other things going on (like user processes which are draining
>     TCP buffers and freeing mbufs back to the caches) to provide
>     this assurance.

  Yes, you're right, but the difference is that in most cases, with the
  kproc, you'll minimize the cost of most of the allocations and frees
  because the kproc will have done the moving in less time.

  However, you seem to bring up a good corner-case example.  I still
  think that for network buffer allocations, and well-tuned watermarks,
  this situation won't be encountered often and, when it is, it can be
  remedied by careful adjusting of watermarks.  Sure, the adjusting of
  the watermarks would influence the on-the-fly case as well.  But, the
  kproc case has other advantages for the common-case that this
  corner-case you bring up ignores.  Notably, in the common case
  (when you don't have huge sweep-frees followed by huge sweep
  allocations going on) the kproc minimizes the number of times the main
  alloc/free code has to go to VM.

  I understand that the pageout daemon probably employs an algorithm
  that can get de-stabilized by large shifting of memory from one
  subsystem to another.  However, my argument is that the effect of
  moving slightly larger chunks of memory for network buffers is more
  good than bad.  There are more common cases than there are corner
  cases and for the corner cases I think that I could work out a decent
  recovery mechanism (the kproc could be temporarily 'turned off,' for
  example).

  Here's what I think I'll do in order to get what I'm sure we both want
  immediately without slowing down progress.  I'm going to implement the
  on-the-fly freeing to VM case (this is pretty trivial).  I'll present
  that and we can get that into the tree (so that we can at least
  recover resources following network spikes).  I'll keep the kproc code
  here and try to tune it to demonstrate eventually that it does the
  right thing and that corner cases are minimized.  I'll also try
  varying the number of objects per bucket, especially in the cluster
  case, and see where we go from there.  Keep in mind that because this
  is a specific network-buffer allocator, we may be able to get away
  with moving larger chunks of objects from a kproc without necessarily
  incurring all the bad effects of general-purpose allocation systems.

> :  Perhaps I can address your concerns if you give me a specific example
> :  where you think the daemon is doing a bad thing, then I can work on
> :  fixing that.  I think for corner cases it would even make sense to
> :  explicitly lower the watermarks (thus forcing the daemon to drain the
> :  caches) directly from the VM, if that's really determined to be an
> :  issue.
> :...
> :-- 
> :Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org
> 
> 
>     Well, Julian's example seemed pretty good, but it's not actually what
>     I am worried about the most.  What I am worried about the most is an
>     effect I saw on BEST Internet's heavily loaded machines quite often,
>     especially the old Challenge L's.  The effect I am worried about is
>     when system disk and/or network and/or cpu load becomes high enough
>     to create artificial slowdowns in apparently unrelated processes.
>     These slowdowns then lead to an increase in buffered data (like TCP
>     data) and processes completing their work less quickly, leading to
>     more processes as new connections come into the machine, and the
>     whole thing spiraling out of control.
> 
>     The advantage of doing things on the fly is that you can 'smooth the
>     curve'.  that is, you approach the point of unusability rather
>     then fall over a cliff and suddenly the machine is dead.  It took
>     an insane amount of effort to make the pageout daemon work that
>     way and I'm afraid that your little process will require at least
>     as much work to achieve the same result.

  It's an interesting corner case, but instead of completely trashing
  the kproc idea (which does gain us something in common cases by
  minimizing interactions with VM), I'll see if I can tune it to react
  properly.  I'll look at what kind of gains we can get from more
  conservative moves from the kproc vis-a-vis larger buckets.  It's easy
  to tune these things without ripping anything else apart, specifically
  because network buffers are allocated in their own special way.

> 					-Matt
> 					Matthew Dillon 
> 					<dillon@backplane.com>

  Matt, thanks for still reading the lists and remaining concerned.

-- 
Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030218093946.A69621>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation