From owner-freebsd-arch  Tue Feb 26 23: 3:45 2002
Delivered-To: freebsd-arch@freebsd.org
Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14])
	by hub.freebsd.org (Postfix) with ESMTP id 0CC1737B417
	for <arch@freebsd.org>; Tue, 26 Feb 2002 23:03:40 -0800 (PST)
Received: from localhost (jroberson@localhost)
	by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id g1R73d807187
	for <arch@freebsd.org>; Wed, 27 Feb 2002 02:03:39 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Date: Wed, 27 Feb 2002 02:03:39 -0500 (EST)
From: Jeff Roberson <jroberson@chesapeake.net>
To: arch@freebsd.org
Subject: Slab allocator
Message-ID: <20020227005915.C17591-100000@mail.chesapeake.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: owner-freebsd-arch@FreeBSD.ORG
Precedence: bulk
List-ID: <freebsd-arch.FreeBSD.ORG>
List-Archive: <http://docs.freebsd.org/mail/> (Web Archive)
List-Help: <mailto:majordomo@FreeBSD.ORG?subject=help> (List Instructions)
List-Subscribe: <mailto:majordomo@FreeBSD.ORG?subject=subscribe%20freebsd-arch>
List-Unsubscribe: <mailto:majordomo@FreeBSD.ORG?subject=unsubscribe%20freebsd-arch>
X-Loop: FreeBSD.ORG

I have patches available that implement a slab allocator.  This was mostly
inspired by the solaris allocator of the same name.  I have deviated
somewhat from their implementation though. I will describe some of the high
level features here.

Firstly, it has a zone like interface, where objects of the same type/size
are allocated from the same zone.  This allows you to do object caching so
that users of the interface may depend on the state of the object upon
allocation.  This allows them to skip potentially expensive
initialization.  This is some times referred to as type stable storage.

The design also allows you to free memory back to the system on demand.
Currently this is done by the pageout daemon.  All zones are scanned for
whole free pages and may release their pages.  I implemented a 20 second
working set algorithm so that the zone will try to keep enough free items
to satisfy the last 20 seconds worth of load.  This stops the zones from
releasing memory they will just ask for again.  This feature is disabled
via a zone create time flag for objects which depend on what some folks
call type stable storage.  This is common among objects with a generation
count.  I call this alternately 'broken' and 'address stable storage'.  I
intend to fix those objects which rely on this behavior if this
implementation is accepted.

There are also per cpu queues of items, with a per cpu lock.  This allows
for very effecient allocation, and also it provides near linear
performance as the number of cpus increase.  I do still depend on giant to
talk to the back end page supplier (kmem_alloc, etc.).  Once the VM is
locked the allocator will not require giant at all.

Using this implementation I have replaced malloc with a wrapper that calls
into the slab allocator.  There is a zone for every malloc size.  This
allows us to use non power of two malloc sizes.  This could yield
significant memory savings.  Also, using this approach we automatically
get a fine grain locked malloc.

I would eventually like to pull other allocators into uma (The slab
allocator).  We could get rid of some of the kernel submaps and provide a
much more dynamic amount of various resources.  Something I had in mind
were pbufs and mbufs, which could easily come from uma.  This gives us the
ability to redistribute memory to wherever it is needed, and not lock it
in a particular place once it's there.

I'm sure you're all wondering about performance.  At one point uma was
much faster than the standard system, but then I got around to finishing
it. ;-)  At this point I get virtually no difference in the time it takes
to compile a kernel from the orignal kernel. Once more object
initializers are implemented this will only improve.  On workloads
that cause heavy paging I have noticed considerable improvements due to
the release of pages that were previously permanent.  I will get some
numbers on this soon.  I have old statistics, but too much has changed for
me to post them.

There are a few things that need to be fixed right now.  For one, the zone
statistics don't reflect the items that are in the per cpu queues.  I'm
thinking about clean ways to collect this without locking every zone and
per cpu queue when some one calls sysctl.  The other problem is with the
per cpu buckets.  They are a fixed size right now.  I need to define
several zones for the buckets to come from and a way to manage
growing/shrinking the buckets.

There are two things that I would really like comments on.

1) Should I keep the uma_ prefixes on exported functions/types.
2) How much of the malloc_type stats should I keep?  They either require
atomic ops or a lock in their current state.  Also, non power of two
malloc sizes breaks their usage tracking.
3) Should I rename the files to vm_zone.c vm_zone.h, etc?


Since you've read this far, I'll let you know where the patch is. ;-)

http://www.chesapeake.net/~jroberson/uma.tar

This includes a patch to the base system that converts several previous
vm_zone users to uma users, and it also provides a vm_zone wrapper for
those that haven't been converted.  I did this to minimize the diffs so it
would be easier to review.  This also has vm/uma* which you need to
extract into your sys/ directory.

Any feedback is appreciated.  I'd like to know what people expect from
this before it is committable.

Jeff

PS Sorry for the long winded email. :-)


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message