From owner-freebsd-arch Tue Feb 26 23: 3:45 2002 Delivered-To: freebsd-arch@freebsd.org Received: from mail.chesapeake.net (chesapeake.net [205.130.220.14]) by hub.freebsd.org (Postfix) with ESMTP id 0CC1737B417 for ; Tue, 26 Feb 2002 23:03:40 -0800 (PST) Received: from localhost (jroberson@localhost) by mail.chesapeake.net (8.11.6/8.11.6) with ESMTP id g1R73d807187 for ; Wed, 27 Feb 2002 02:03:39 -0500 (EST) (envelope-from jroberson@chesapeake.net) Date: Wed, 27 Feb 2002 02:03:39 -0500 (EST) From: Jeff Roberson To: arch@freebsd.org Subject: Slab allocator Message-ID: <20020227005915.C17591-100000@mail.chesapeake.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG I have patches available that implement a slab allocator. This was mostly inspired by the solaris allocator of the same name. I have deviated somewhat from their implementation though. I will describe some of the high level features here. Firstly, it has a zone like interface, where objects of the same type/size are allocated from the same zone. This allows you to do object caching so that users of the interface may depend on the state of the object upon allocation. This allows them to skip potentially expensive initialization. This is some times referred to as type stable storage. The design also allows you to free memory back to the system on demand. Currently this is done by the pageout daemon. All zones are scanned for whole free pages and may release their pages. I implemented a 20 second working set algorithm so that the zone will try to keep enough free items to satisfy the last 20 seconds worth of load. This stops the zones from releasing memory they will just ask for again. This feature is disabled via a zone create time flag for objects which depend on what some folks call type stable storage. This is common among objects with a generation count. I call this alternately 'broken' and 'address stable storage'. I intend to fix those objects which rely on this behavior if this implementation is accepted. There are also per cpu queues of items, with a per cpu lock. This allows for very effecient allocation, and also it provides near linear performance as the number of cpus increase. I do still depend on giant to talk to the back end page supplier (kmem_alloc, etc.). Once the VM is locked the allocator will not require giant at all. Using this implementation I have replaced malloc with a wrapper that calls into the slab allocator. There is a zone for every malloc size. This allows us to use non power of two malloc sizes. This could yield significant memory savings. Also, using this approach we automatically get a fine grain locked malloc. I would eventually like to pull other allocators into uma (The slab allocator). We could get rid of some of the kernel submaps and provide a much more dynamic amount of various resources. Something I had in mind were pbufs and mbufs, which could easily come from uma. This gives us the ability to redistribute memory to wherever it is needed, and not lock it in a particular place once it's there. I'm sure you're all wondering about performance. At one point uma was much faster than the standard system, but then I got around to finishing it. ;-) At this point I get virtually no difference in the time it takes to compile a kernel from the orignal kernel. Once more object initializers are implemented this will only improve. On workloads that cause heavy paging I have noticed considerable improvements due to the release of pages that were previously permanent. I will get some numbers on this soon. I have old statistics, but too much has changed for me to post them. There are a few things that need to be fixed right now. For one, the zone statistics don't reflect the items that are in the per cpu queues. I'm thinking about clean ways to collect this without locking every zone and per cpu queue when some one calls sysctl. The other problem is with the per cpu buckets. They are a fixed size right now. I need to define several zones for the buckets to come from and a way to manage growing/shrinking the buckets. There are two things that I would really like comments on. 1) Should I keep the uma_ prefixes on exported functions/types. 2) How much of the malloc_type stats should I keep? They either require atomic ops or a lock in their current state. Also, non power of two malloc sizes breaks their usage tracking. 3) Should I rename the files to vm_zone.c vm_zone.h, etc? Since you've read this far, I'll let you know where the patch is. ;-) http://www.chesapeake.net/~jroberson/uma.tar This includes a patch to the base system that converts several previous vm_zone users to uma users, and it also provides a vm_zone wrapper for those that haven't been converted. I did this to minimize the diffs so it would be easier to review. This also has vm/uma* which you need to extract into your sys/ directory. Any feedback is appreciated. I'd like to know what people expect from this before it is committable. Jeff PS Sorry for the long winded email. :-) To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message