From owner-freebsd-current  Mon Nov 25 13: 2:44 2002
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 9354237B401; Mon, 25 Nov 2002 13:02:41 -0800 (PST)
Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id AE45543EA9; Mon, 25 Nov 2002 13:02:40 -0800 (PST)
	(envelope-from bmilekic@unixdaemons.com)
Received: (from bmilekic@localhost)
	by tesla.distributel.net (8.11.6/8.11.6) id gAPL1Ms75728;
	Mon, 25 Nov 2002 16:01:22 -0500 (EST)
	(envelope-from bmilekic@unixdaemons.com)
Date: Mon, 25 Nov 2002 16:01:22 -0500
From: Bosko Milekic <bmilekic@unixdaemons.com>
To: Andrew Gallatin <gallatin@cs.duke.edu>
Cc: Julian Elischer <julian@elischer.org>,
	Robert Watson <rwatson@freebsd.org>, Luigi Rizzo <rizzo@icir.org>,
	current@freebsd.org
Subject: Re: mbuf header bloat ?
Message-ID: <20021125160122.A75673@unixdaemons.com>
References: <15840.8629.324788.887872@grasshopper.cs.duke.edu> <Pine.BSF.4.21.0211232306410.28833-100000@InterJet.elischer.org> <15841.17237.826666.653505@grasshopper.cs.duke.edu> <20021125130005.A75177@unixdaemons.com> <15842.27547.385354.151541@grasshopper.cs.duke.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5.1i
In-Reply-To: <15842.27547.385354.151541@grasshopper.cs.duke.edu>; from gallatin@cs.duke.edu on Mon, Nov 25, 2002 at 01:27:39PM -0500
Sender: owner-freebsd-current@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-current.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-current>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-current>
X-Loop: FreeBSD.ORG


On Mon, Nov 25, 2002 at 01:27:39PM -0500, Andrew Gallatin wrote:
> Bosko Milekic writes:
>  > On Sun, Nov 24, 2002 at 04:23:33PM -0500, Andrew Gallatin wrote:
>  > > If we're going to nitpick the mbuf system, a much, much worse problem
>  > > is that you cannot allocate an mbuf chain w/o holding Giant, which
>  > > stems from the mbuf system eventually calling kmem_malloc().  This
>  > > effectively prevents any network driver from being giant-free.  When
>  > > mbufs are low, mb_alloc() calls mb_pop_cont().  This, in turn, calls
>  > > kmem_malloc() which requires Giant...
>  > 
>  >   This is not entirely true.  You can allocate an mbuf chain without
>  >   holding Giant if the caches are well populated - and they should be
>  >   in the common/general case.  You can in fact modify the allocator to
>  >   just not do a kmem_malloc() if called with M_DONTWAIT, but I don't
>  >   think you'd want to do this at this point.
>
> Unfortunately, the common case is not good enough when figuring out
> locking for network drivers and the uncommon case forces all network
> drivers under Giant.  I was thinking about doing what you describe,
> but my misconception about ref counters being malloced was holding me

  Firstly, it should be noted that the behavior of calling kmem_malloc()
  when its caches are empty is an old property that has been carried
  over from the original allocator - in other words, it is not something
  that I arbitrarily introduced.

  With that said, we may want to think about changing it or at least
  introducing a third case that would modify the semantics so that Giant
  would never be called.  For instance, in addition to M_TRYWAIT and
  M_DONTWAIT, you could invent M_CACHEONLY that would only attempt a
  quick cache allocation.  However, I'm unsure as to the advantages this
  would bring - if any.  Why is it so much of a problem if your driver
  wants to grab Giant?  Are you afraid of lock order reversals somewhere
  down the line?  Perhaps there is a misconception that we can clear up
  here before any work is done.
 
> back.  We'll also need a kproc that can wake up every now and then to
> expand the pool if allocations at interrupt time failed.   Or do you
> already have a mechanism for that?

  The intended mechanism is the kproc and when the allocator was first
  designed and written, I had taken that into account although I have
  not implemented the kproc yet.  The idea was to have a kproc maintain
  a balance of the caches without interfering with the general
  allocation cases.  The way this would be accomplished is actually
  fairly elegant, although we would have to be careful when doing the
  implementation:

  (i) Have the kproc make sure that if we're under a low watermark, do a
   block allocation from the mbuf and/or cluster maps and populate the
   general global cache (thereby not interfering with simultaneous
   frees to the per-CPU caches).

  (ii) Have the kproc check if we're above a high watermark on the
   general global cache and if so free some pages back to VM without
   interfering with simultaneous allocations on the per-CPU caches.

  (iii) Have the kproc do some bookkeeping and auto-tune its high and
   low watermarks.

  The key words in all three goals above is "not interfering with
  simultaneous allocations/frees."  This is a big advantage and is one
  of the reasons that the mbuf allocator exists.  What it means is that
  you can muck with the VM and indirectly populate the mbuf and cluster
  caches without interfering with the common {mbuf,cluster} free and
  allocation case.  Graphically:

  [ VM ] <--> [ global cache ] <--> [ per-CPU caches ]
          ^                      ^
          |                      |
      [ only             ]    [ common case allocations and ]
      [ done by          ]    [  frees                      ]
      [ kproc (to        ]
      [ be written)      ]
      [ and when         ]
      [ certain types of ]
      [ allocations fail ]	

  (I hope that came out OK).  As you can see, the concept is very simple
  and with the current infrastructure should be fairly easy to
  implement.  If you're wondering, I was going to do is as a 5.1 feature
  at some point because I've been so swamped with other things right now
  and so I do not foresee being able to do this in time for 5.0.

>  > > The mbuf system calls malloc in other ways too.  The first time you
>  > > use a cluster, m_ext.ext_ref_cnt is malloc()'ed, and malloc is called
>  > > when the mbuf map is expanded...   I assume malloc will eventually
>  > > call kmem_malloc(), leading to the same locking problems.
>  > 
>  >   Actually, that has been changed a while ago.  The ref_cnt for clusters
>  >   is no longer malloc()'d.  A contiguous space is used from which
>  >   cluster ref counters are "allocated" (just like in the old/original
>  >   design).  This modification was made a while ago as an optimisation.
> 
> Cool.  The following code confused me into making the above statement.  Is it
> out of date? 
> 
> #define _mext_init_ref(m, ref) do {                                     \
>         (m)->m_ext.ref_cnt = ((ref) == NULL) ?                          \
>             malloc(sizeof(u_int), M_MBUF, M_NOWAIT) : (u_int *)(ref);   \
>         if ((m)->m_ext.ref_cnt != NULL) {                               \
>                 *((m)->m_ext.ref_cnt) = 0;                              \
>                 MEXT_ADD_REF((m));                                      \
>         }                                                               \
> } while (0)

  It is not out of date.  The code means:

  "If you've given me a counter then I'll use it otherwise I'll try to
  allocate one with malloc()."

  And the cluster allocation code will provide a counter by indexing
  into a linear contiguous page array (note that this is also why
  clusters are required to come from a seperate contiguous address map -
  otherwise this wouldn't work).

> Thanks,
> 
> Drew

-- 
Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message