From owner-freebsd-arch  Thu May  3 16:56: 5 2001
Delivered-To: freebsd-arch@freebsd.org
Received: from technokratis.com (modemcable049.41-203-24.mtl.mc.videotron.ca [24.203.41.49])
	by hub.freebsd.org (Postfix) with ESMTP id AE50337B424
	for <freebsd-arch@freebsd.org>; Thu,  3 May 2001 16:55:31 -0700 (PDT)
	(envelope-from bmilekic@technokratis.com)
Received: (from bmilekic@localhost)
	by technokratis.com (8.11.3/8.11.3) id f43Nx4Z53585
	for freebsd-arch@freebsd.org; Thu, 3 May 2001 19:59:04 -0400 (EDT)
	(envelope-from bmilekic)
Date: Thu, 3 May 2001 19:59:04 -0400
From: Bosko Milekic <bmilekic@technokratis.com>
To: freebsd-arch@freebsd.org
Subject: Mbuf slab [new allocator]
Message-ID: <20010503195904.A53281@technokratis.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.2.5i
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.ORG


Hello,

  Anyone interested in the mbuf subsystem code should probably read this.
Others may still read it, but it is somewhat longer than your average Email,
so consider this a warning. :-)
  Also, although I tried my best to cover most issues here, feel free to let
me know if I should clarify some points.

  Not so long ago, as I'm sure some of you remember, Alfred committed a patch
to the mbuf code that collapsed what then were three seperate mutex locks -
one for the mbuf free list, one for the cluster free list, and one for the
m_ext counter free list - into one single mbuf lock, locking all three
free lists. After an argument with Alfred, I caved (as I sometimes do) and
agreed with him on the assertion that overall, this would be a good move
seeing as how we'd probably move to a slab allocator for the mbuf subsystem and
that when we do, it will be more advantageous to have one lock for all three
object types in order to reduce too much cache invalidation ping-ponging from
one CPU to the other.
  Earlier than that, even, Alfred had written what I think presently needs
work to run (I'm not sure if it compiles in the state that it presently is
in), what he calls the "memcache allocator." He put the code here:
	http://people.freebsd.org/~alfred/memcache/

  After reading the "memcache" code, I was generally happy with it, and Alfred
has provided documentation backing up overall performance of the slab allocator
[that is, not necessarily the performance of `memcache' itself, but rather
the concepts that memcache was built on] and after thinking about the whole
thing for a while agreed with Alfred that a slab allocator would be a worthy
thing to have for mbuf and mcluster allocations. However, there were some things
that clearly needed to be done differently from memcache, and I wanted to
keep the flexibility of being able to micro-optimize for the mbuf-related
allocations in the future. Also, mbuf-related allocations work differently
than `regular' allocations in that the status quo is such that we do not
free VM pages back to the map that we allocate them from. So once we allocate
a page, we take the address space from the map, wire it down, and never
return the address space. This saves us from having to free back to VM when
freeing mbufs - a relatively expensive task - and allows us to keep mbufs
and mclusters `cached' on free lists, from which we can quickly allocate.
  Sure, memcache provides the option ensuring that we'll also keep a certain
(tunable) number of objects cached on free lists, but when memcache cannot
find a page and when it needs to allocate one, it relied entirely on the VM
code to do the blocking for it, until a page would be freed to the map, or
until it could find a page (depending on the nature of the exhaustion).
  For the mbuf allocator, we needed to deal with a third type of exhaustion.
With the mbuf allocations, you have a finite amount of address space which,
as the number of allocations grows, shrinks. So, when you see that you have
no more mbufs or mclusters to allocate, you have to deal with the problem
in one of two ways:

	1 - either you have address space left in your map, in which case you're
	    entitled to a page and in which case you may very well block in the
	    VM code until you get it.

	or

	2 - you have already allocated all the pages that you are entitled to
	    (i.e. no more address space) and so your only option is to
	    drain the protocols and block waiting for a consumed object (mbuf
	    or cluster) to become available. This blocking is done in the mbuf
	    code and is implemented through a condition variable.

  So clearly, these two conditions needed to be maintained. Furthermore, I
wanted to at the same time leave the possibility to be able to implement
some sort of `lazy freeing' of pages that would be done periodically in the
spirit of your classic garbage collector. In short, I wanted memcache with
several twists and turns, as is usually required for the mbuf code.

  So, thanks to Alfred's patience in answering some questions I had, I decided
to write an mbuf slab-like allocator. I put a relatively early version of the
code here:

	http://people.freebsd.org/~bmilekic/code/mb_slab/

  If you're still with me, a few mentions regarding the code:

	- I split out the allocator code into kern_mbuf.c, and left the mbuf
	  utility routines in uipc_mbuf.c

	- mbuf.h macros are a lot cleaner/smaller, I plan to make them even
	  more smaller eventually.

	- Matt Dillon had some really neat suggestions on implementing real
	  fast allocations from the PCPU lists that the allocator maintains.
	  Some complications arose due to the fact that we have pre-emption
	  enabled. Work-arounds were also proposed, but I left the optimization
	  out of this code for now. We can deal with it later and I am
	  interested in perhaps inlining, if size permits, a portion of the
	  `quick allocation' in the macros, once more, but for now I'd rather
	  it remains a function call in all cases.

	- The general allocation and free cases are very fast. Very comparable
	  to what we presently have. The downside is slightly more code due
	  to tougher list manipulation during allocation but then the upside
	  (along with other things that I don't feel like mentionning right now)
	  is that contention is reduced in MP environments. So it comes out
	  about even [*], if not even better now due to other improvements.
	  Unfortunately, I can't really benchmark any of this so please don't
	  ask (although feel free to try yourself and/or to also refer to the
	  documentation available on the Internet discussing advantages of
	  slab allocators in overall performance).

[*] Yeah, this is a very rough inductive prediction, although it is supported
    by many factors. 

	- There is more documentation in kern_mbuf.c itself.

	- mbstats is broken for now. Re-implementing statistics is relatively
	  easy and I left it out of the first version.

	- I have been running this code for over a week now on a dual-proc.
	  CURRENT machine. It has been running fine although I have not tried
	  to forcibly exhaust the maps to see how it behaves as I have not
	  adjusted kmem_malloc() to work exactly as it needs to in order
	  for proper blocking to occur. In case you're wondering exactly
	  what it is that kmem_malloc() needs to be able to do: it just needs
	  to be able to mark a map_starved flag once the map it's allocating
	  from has been exhausted, if that map is mbuf-related (just like it
	  presently does for mb_map). The point is: kmem_malloc() shouldn't
	  block itself if the map is exhausted as it will never wake up because
	  we never free anything back to the map.

	- Free lists in `buckets' (you'll get it if you glance at the code)
	  are implemented not by linking mbufs or clusters in a singly-linked
	  fashion as is done in the current mbuf code. Rather, a pointer-array
	  is maintained (like memcache does). This avoids having the thread
	  freeing the object write to the object (and consequently likely
	  invalidate the producers cache). Unfortunately, this turns out to
	  suck very much for m_ext counters, as they are 4 bytes each. Having
	  a pointer array to a page of 4-byte objects consumes another page
	  on the x86, for instance. I am thinking of alternatives for
	  m_ext counters because of this problem. In fact, I am toying with
	  the idea of replacing allocatable m_ext counters with something
	  else - perhaps storing the counter in some area of the cluster
	  or whatever other provided m_ext buffer (sf_buf, etc.)

	- You'll note that mb_map is subdivided into three other submaps:
	  one for mbufs, one for clusters, one for counters. This is done
	  for several reasons. One is to avoid having cluster allocations
	  consume the portion of the map explicitly reserved for mbufs,
	  thus reducing the total number of allocatable mbufs (because
	  we will never free that space back to the map, once we have consumed
	  it for clusters, it will stay for clusters). Peter Wemm had
	  suggested eliminating mb_map and submaps and having mbuf allocations
	  made directly from mb_map's parent map, kmem_map. The limit for
	  allocations would be limited by a sysctl-tunable variable. Alfred
	  suggested having lazy freeing done back to kmem_map on the behalf
	  of the mbuf subsystem, when needed, in order to avoid having mbufs
	  and clusters completely consume kmem_map. I was toying about
	  implementing this as well but ran into some problems. Notably,
	  in order to be able to take the address of an object being freed
	  and retrieve its corresponding `bucket structure' (again, if you
	  glance at the code, this will make sense), I need to be able
	  to produce a unique index from the address of the page in which
	  the object is found. This is easy to do in the present implementation
	  of mb_slab as each type of object (mbuf or cluster) has its own
	  map, so I can safely do: (page_address) - (base_of_map) and divide
	  by PAGE_SIZE, then use that to index into a pointer-array holding
	  the address of my bucket (memcache does a similar thing, whereas
	  our implementation of kernel malloc() uses a slightly different
	  hashing method). The obvious solution would be to keep a sparse
	  pointer table which would account for every page in kmem_map. But
	  the problem with this is that the resulting table would be a little
	  too large for what it's worth.

  Well, that's about all that I can think of for the moment. I would appreciate
feedback from you folks both on the strictly technical aspects of this Email
as well as on other suggestions you may have regarding the issues presented
above.
  Unfortunately, my final exam period begins next week and I likely will be
unable to do any real work on this until about 2 weeks from now. After exams,
I would like to finish cleaning up this version of the allocator (without
Peter's suggestion and `lazy freeing' implemented yet) and hopefully commit
that first (unless serious objections arise, as usual). Once that's done,
I would then continue to work on adding the latter two, if we can work out
the problems, and whatever else comes up/requires improvement.

Regards,
-- 
 Bosko Milekic
 bmilekic@technokratis.com


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message