From owner-freebsd-arch  Mon Feb 17 16:24:46 2003
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id E770937B401
	for <freebsd-arch@FreeBSD.ORG>; Mon, 17 Feb 2003 16:24:42 -0800 (PST)
Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 245BE43FAF
	for <freebsd-arch@FreeBSD.ORG>; Mon, 17 Feb 2003 16:24:42 -0800 (PST)
	(envelope-from bmilekic@unixdaemons.com)
Received: (from bmilekic@localhost)
	by tesla.distributel.net (8.11.6/8.11.6) id h1I0OII67187;
	Mon, 17 Feb 2003 19:24:18 -0500 (EST)
	(envelope-from bmilekic@unixdaemons.com)
Date: Mon, 17 Feb 2003 19:24:18 -0500
From: Bosko Milekic <bmilekic@unixdaemons.com>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: freebsd-arch@FreeBSD.ORG
Subject: Re: mb_alloc cache balancer / garbage collector
Message-ID: <20030217192418.A67144@unixdaemons.com>
References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <200302180000.h1I00bvl000432@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 17, 2003 at 04:00:37PM -0800
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, Feb 17, 2003 at 04:00:37PM -0800, Matthew Dillon wrote:
> :  What the daemon does is replenish the per-CPU caches (if necessary) in
> :  one shot without imposing the overhead on the allocation path.  That
> :  is, it'll move a bunch of buckets over to the per-CPU caches if they
> :  are under-populated; doing that from the main allocation path is
> :  theoretically possible but tends to produce high spiking in latency.
> :  So what the daemon basically is is a compromise between doing it in
> :  the allocation/free path on-the-fly, and doing it from a parallel
> :  thread.  Additionally, the daemon will empty part of the global cache
> :...
> 
>     Hmm.  Well, you can also replentish the per-CPU caches in-bulk on the fly.
>     You simply pull in more then one buffer and you will reap the same
>     overhead benefits in the allocation path.  If you depend on a thread
>     to do this then you can create a situation where a chronic buffer shortage
>     in the per-cpu cache can occur if the thread doesn't get cpu quickly
>     enough, resulting in non-optimal operation.  In otherwords, while it
>     may seem you are saving latency in the critical path (the network trying
>     to allocate a buffer), I think you might actually be creating a situation
>     where instead of latency you wind up with a critical shortage.

  Hmm, not quite.  You'd need to look at the code;  there is no shortage
  situation created here.  As I said, the model I employ is not a purely
  balance-everything-from-the-daemon model.  It is a compromise.  In
  other words, if you can't get an object from the per-CPU cache, you'll
  try to get an object from the global cache.  If you can get an object
  from the global cache, you'll take it an move a bucket of objects from
  the global cache to the per-CPU cache for future use.  If you can't
  get an object from the global cache either, it's OK, you'll allocate
  from VM.

  The difference comes in the free case where you'll free the object to
  the bucket, wherever the bucket is sitting (usually this will be your
  per-CPU cache but it may be, in the non-common case, the global
  cache).  You'll never flush any of the caches back to the VM or move
  anything more than a bucket of objects between caches in the
  allocation/deallocation cases.  The daemon takes care of that when it
  can.  So, you don't have a resource situation no matter what.

>     I don't think VM interaction is that big a deal.  The VM system has a
>     notion of a 'shortage' and a 'severe shortage'.  When you are allocating
>     mbufs from the global VM system into the per-cpu cache you simply 
>     allocate up to <hysteresis> into the cache or until the VM system gets
>     low (but not severely low) on memory.  The hysteresis does not have to
>     be much to reap the benefits and mitigate the overhead of the global
>     mutex(es)... just 5 or 10 mbufs would mitigate global mutex overhead
>     to the point where it becomes irrelevant.

  I already pretty much do this.  If I really need to, I *will*
  _allocate_ up to a bucket of mbufs or clusters from VM.  A "bucket"
  right now is PAGE_SIZE-worth, but that's modifiable.

>     By creating a thread you are introducing more moving parts, and like
>     a physical system these moving parts are going to ineract with each
>     other.  Remember, the VM system is *already* trying to ensure that 
>     enough free pages exist in the system.  If you have a second thread
>     eating memory in large globs it is far more likely that you will
>     destabilize the pageout daemon and create an oscillation between the
>     two threads (pageout daemon and your balancer).  This might not turn up
>     in benchmarks (which tend to focus on just one subsystem), but it could
>     lead to some pretty nasty degenerate cases under heavy general loads.
>     I think it is far better to let the VM system do its job and pull the
>     mbufs in on-the-fly in smaller chunks which are less likely to destabilize
>     the pageout daemon.

  This will not happen in the common case.  The one exception is if your
  caches are not balanced or are too low.  Assuming that the watermarks
  are tuned properly you should always have about the average of the
  watermarks in your caches; if you don't, all the daemon will do is
  replenish them to that value.  Once that's done, it won't do anymore
  replenishing unless you go low again.  Further, if you spike and then
  return back to normal, the free code will end up moving buckets of
  objects back to the general cache and the daemon will only free back
  to the VM from the global cache and, again, it won't free everything,
  but just enough to bring back the general cache number of objects to
  the average of the watermarks.  So, you can still allocate from the VM
  in your allocation paths if you need to, but instead of wasting time
  allocating a bunch of buckets, setting up your free object lists,
  etc.etc., you'll only allocate one bucket and let the daemon do the
  rest.

  Also, keep in mind that the maps for mbufs and clusters are finite, so
  no matter what you do, you're not going to be able to go beyond the
  size of those maps.  The corner cases you're probably thinking of are
  those where the rest of the system is strapped for memory and your
  mbuf daemon may be holding on to too much.  The thing is that the
  daemon should not be over-allocating large chunks unless the caches
  are really low anyway (you can set the low watermark, keep that in
  mind), and further, in the extreme case, you could even have the VM
  system wakeup the daemon to drain ALL the caches in seriously extreme
  situations (but those are really corner cases in which case you're
  probably screwed anyway).

>     This can be exasperated... made even worse, if your balancing thread is
>     given a high priority.  So you have the potential to starve the mbuf
>     system if the balancing thread is too LOW a priority, and the potential
>     to destabilize the VM system if the balancing thread is too HIGH a
>     priority.
> 
>     Also, it seems to me that VM overheads are better addressed in the
>     UMA subsystem, not in a leaf allocation subsystem.

  Again, this is not a leaf-allocation subsystem anymore than the UMA
  allocator is.  Both interface directly with kmem_malloc/kmem_free.

> 						-Matt

-- 
Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org

"If we open a quarrel between the past and the present, we shall
 find that we have lost the future."

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message