From owner-freebsd-arch  Mon Feb 17 12:41:55 2003
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 349FE37B401
	for <freebsd-arch@FreeBSD.ORG>; Mon, 17 Feb 2003 12:41:52 -0800 (PST)
Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 7094A43FAF
	for <freebsd-arch@FreeBSD.ORG>; Mon, 17 Feb 2003 12:41:51 -0800 (PST)
	(envelope-from bmilekic@unixdaemons.com)
Received: (from bmilekic@localhost)
	by tesla.distributel.net (8.11.6/8.11.6) id h1HKfR366265;
	Mon, 17 Feb 2003 15:41:27 -0500 (EST)
	(envelope-from bmilekic@unixdaemons.com)
Date: Mon, 17 Feb 2003 15:41:27 -0500
From: Bosko Milekic <bmilekic@unixdaemons.com>
To: Matthew Dillon <dillon@apollo.backplane.com>
Cc: Andrew Gallatin <gallatin@cs.duke.edu>, freebsd-arch@FreeBSD.ORG
Subject: Re: mb_alloc cache balancer / garbage collector
Message-ID: <20030217154127.A66206@unixdaemons.com>
References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <200302171742.h1HHgSOq097182@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 17, 2003 at 09:42:28AM -0800
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG


On Mon, Feb 17, 2003 at 09:42:28AM -0800, Matthew Dillon wrote:
>     The work looks great, but I have to say I have grave reservations 
>     over any sort of thread-based cleanup / balancing code for a memory
>     allocation subsystem.  The only advantage that I can see is that you
>     get good L1 cache effects, but that is counterbalanced by a number of
>     severe disadvantages which have taken a long time to clear up in other
>     subsystems which use separate threads (pageout daemon, buf daemon,
>     syncer, pmap garbage collection from the pageout daemon, etc).  Most of
>     these daemons have very good reasons for needing a thread, but I can't
>     think of any reason why a straight memory allocator would *require*
>     a thread.
> 
>     Wouldn't it be easier and more scaleable to implement the hysteresis on
>     the fly?  It sounds like it ought to be simple... you have a sysctl
>     to set the per-cpu free cache size and hysteresis (for example, 32[8],
>     aka upon reaching 32 free 32 - 8 = 24 to the global cache, keeping 8).
>     Overflow goes into a global pool.  Active systems do not usually
>     bounce from 0 to the maximum number of mbufs and back again, over
>     and over again.  Instead they tend to have smaller swings and 'drift'
>     towards the edges, so per-cpu hysteresis should not have to exceed
>     10% of the total available buffer space in order to reap the maximum
>     locality of reference and mutex benefit.  Even in a very heavily loaded
>     system I would expect something like 128[64] to be sufficient.  This
>     sort of hysteresis could be implemented trivially in the main mbuf
>     freeing code without any need for a thread and would have the same 
>     performance / L1 cache characteristics.  Additionally, on-the-fly
>     hysteresis would be able to handle extreme situations that a thread 
>     could not (such as extreme swings), and on-the-fly hysteresis can
>     scale in severe or extreme situations while a thread cannot.

  The allocator does do some hysteresis for what concerns the per-CPU
  caches on the fly.  It will move a bucket over to the global cache if
  the pcpu cache has gone above the high watermark and we're freeing.
  It's pretty easy to teach it to move more than a single bucket, too,
  if we find that it's worthwhile.  Perhaps tuning the watermark code to
  do this more efficiently is worth looking at.

  What the daemon does is replenish the per-CPU caches (if necessary) in
  one shot without imposing the overhead on the allocation path.  That
  is, it'll move a bunch of buckets over to the per-CPU caches if they
  are under-populated; doing that from the main allocation path is
  theoretically possible but tends to produce high spiking in latency.
  So what the daemon basically is is a compromise between doing it in
  the allocation/free path on-the-fly, and doing it from a parallel
  thread.  Additionally, the daemon will empty part of the global cache
  to the VM, if it needs to - this process is relatively expensive and
  also produces irregularities in performance, particularly if you
  decide to do it in the main free path.  One of the things I really
  wanted to focus on was significantly minimizing any VM interactions
  during network buffer allocations and frees.  Once you start
  minimizing the number of times you'll be going back to the VM with
  watermarks, you inevitably increase the number of checks you have to
  do in the regular free case.  If you then minimize the number of
  checks and computations to determine when to flush to VM and when not
  to, you often end up flushing too often.  So it's a tradeoff, really.
  Perhaps you'll eventually converge to some 'reasonable' compromise, but
  if you can do it from a thread scheduled in parallel, then it's even
  easier as long as, as you say, there are no complicated issues to deal
  with because of the fact that suddenly you have this daemon which runs
  in parallel and modifies the behavior of your allocations.  In this
  case, though, the allocator was designed with the idea that freeing
  and balancing would be implemented from a kproc scheduled in parallel
  anyway, so I hope that those complexities are a non-issue.  So, in
  summary, the daemon here is not the only thing doing the balancing;
  it's a "compromise," if you will, if both models.

  As for "extreme swings," I would tend to think that it's the contrary.
  Like, say you have a huge spike in allocations; with the current
  model, you'll be able to handle it even in the extreme case and
  you'll be able to recover via the kproc.  If you have a series of huge
  spikes, then this model may in fact even work out better for you
  because, due to scheduling, you may defer all attempts to balance
  caches until the end of the spike, so you may actually avoid
  ping-ponging of buckets from cache to cache because you won't be
  relying on the spike data to balance in the long term.

  Anyway, all this is pretty theoretical talk.  My intention is to tune
  this thing and further evaluate performance based on the requirements
  of real life applications.

>     The same argument could also be applied to UMA, btw.
> 
> 					-Matt
> 					Matthew Dillon 
> 					<dillon@backplane.com>

-- 
Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org

"If we open a quarrel between the past and the present, we shall
 find that we have lost the future."

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message