From owner-freebsd-current@FreeBSD.ORG Mon Dec 12 01:48:57 2005 Return-Path: X-Original-To: current@freebsd.org Delivered-To: freebsd-current@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 477CA16A41F for ; Mon, 12 Dec 2005 01:48:57 +0000 (GMT) (envelope-from bucht@acc.umu.se) Received: from mail.acc.umu.se (mail.acc.umu.se [130.239.18.156]) by mx1.FreeBSD.org (Postfix) with ESMTP id 805E043D5E for ; Mon, 12 Dec 2005 01:48:55 +0000 (GMT) (envelope-from bucht@acc.umu.se) Received: from localhost (localhost [127.0.0.1]) by amavisd-new (Postfix) with ESMTP id 737EF49A1; Mon, 12 Dec 2005 02:48:54 +0100 (MET) Received: from shaka.acc.umu.se (shaka.acc.umu.se [130.239.18.148]) by mail.acc.umu.se (Postfix) with ESMTP id 178BA4990; Mon, 12 Dec 2005 02:48:53 +0100 (MET) Received: by shaka.acc.umu.se (Postfix, from userid 23835) id E30C117213; Mon, 12 Dec 2005 02:48:52 +0100 (MET) Date: Mon, 12 Dec 2005 02:48:52 +0100 From: Johan Bucht To: Jason Evans Message-ID: <20051212014852.GA8775@shaka.acc.umu.se> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.4.1i X-Virus-Scanned: amavisd-new at acc.umu.se Cc: current@freebsd.org Subject: Re: New libc malloc patch X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Dec 2005 01:48:57 -0000 Ugh, forgot that I used another smtp server than the one subscribed to the list. So the mail shouldn't have made it to the list, if it has I apologize. Remembered one more thing about locking granularity too. Just a few comments, not sure if all of them are valid since I don't have a recent enough system to apply the patch and test it, so my comments are just based on reading the patch. I might have missed some thing hidden between all the lines differing. Can you put up a patched malloc.c somewhere? I made a proof-of-concept parallel allocator for my Master's Thesis during the spring and some of the comments are about differences and similarities between our implementations. * Fork safety functions Nice to have for all allocators and is something I missed having. Would like to have them regardless if your malloc becomes standard or not. * magazines, per-arena caches If I didn't miss something you don't seem to have it, pardon if I missed them in my quick readings. They can really help the scalability and is something that is implemented in Solaris libumem allocator with great success. I implemented them and they helped bring the scalability up quite a bit on the system I tested my allocator on aswell as bringing the allocation latency down. * Saw you used a tree to look up the allocation sizes at free, this * differs from how I and Solaris libumem does it. Instead I went for * prepending a 8/16 byte tag containing the size of the allocation for * lockless size lookup. * Bucket sizes If I didn't miss something you are using power of two sizes. Might be better to use smaller sizes to avoid internal fragmentation, aswell as improving locking granularity. * Locking granularity You use a single lock for the chunk allocation instead of a lock per bucket size or something similar. This means that you limit access unless threads allocate from their private arenas. Since you hash to get an arena there might be multiple threads accessing the same arena at the same time introducing contention. Might be better to have a lock per allocation size to somewhat improve the situation. * Locking primitive The biggest issue and as David Xu pointed out is probably the locking primitives. The SPINLOCK use has a limit in the threading library and makes is hard to have a lot of mutexes. I ended up using a wrapper around the umtx_lock function to get recursive mutexes and it would probably be better to extend the umtx functions to handle recursion. This would probably also be appreciated by other malloc implementations. Might be interesting to implement some of the ideas from the Linux futex implementation to help umtx. * sbrk vs mmap See you added the ability to try brk first and try mmap if brk fails. This is similar to my implementation aswell as ptmalloc. Looks good. * Big allocations It might be preferable to use mmap/munmap directly for big allocations (bigger than a few pages) to avoid the overhead of trying to be smart. These big allocations should be pretty few and shouldn't be that pessimized by this. It also means some memory savings as you can directly free memory from big allocations. * Ownership Didn't look that much for it so might have missed it. Basicly what happens if you free memory allocated in another arena, do you return memory to the arena the memory was allocated from or to the current threads arena? Not returning it to the original arena leads to cache line sharing. Returning it to the original adds some overhead and it might be argued that applications doing this might not be considered scalable and should be fixed instead. * Standalone version Would be nice to have a standalone version to test the allocator on solaris and linux to compare against their allocators. Would be nice for all the people with only access to SMP machines running another OS. I tested my allocator on both Solaris 9 and Freebsd 5/6 and it was both helpful in testing scalability and impact on small machines aswell as finding bugs in the implementation. I tested my allocator against Solaris libc, Hoard and libumem on a 4-cpu Solaris 9 machine (4x UltraSparc IIIi, 1600MHz, 16384MB RAM) at the university and tested my allocator vs the FreeBSD libc on my 400MHz laptop with too much beer and almost non working cpu fan. Beer as in standing in 10cm of beer in my backpack =). Not sure how much of the implementation is FreeBSD dependant, RB-trees?, Solaris 9 has the benefit of using a local malloc inside libc which makes the pthread library functions accessible from the malloc implementation. I avoided touching the libc malloc as I really didn't care at all about single threaded apps and simply used LD_PRELOAD to override libc malloc. Basicly I throw away memory if you allocate a multiple of the page size since I add a tag prepending the memory and I allocate whole pages. So a 8192 byte allocation becomes a 8200 byte allocation, wasting a page. * Focus - Single thread vs multithread What are the focus of the allocator, how much memory is it worth to waste to implement a good scalable implementation. This is the reason I think Solaris still keeps a serial allocator in libc and makes libumem accessible through LD_PRELOAD and friends. * Benchmarks Would be really nice with a few benchmarks against phkmalloc, ptmalloc, hoard and libumem to see how well it stands. Pure local arena allocation apps, Larson benchmarks and some other. Might also be good to see how much more memory it consumes compared to phkmalloc. * Comment Might be wise to fix the comment about always using mmap since brk can be used. * Conclussion Despite not having tested it it looks good and would be nice to have it in FreeBSD. Some parts are really independent on the allocator and would be nice to have, regardless of the general consensus. -- /Johan Bucht bucht@acc.umu.se