From owner-freebsd-arch Mon Feb 17 12:41:55 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 349FE37B401 for ; Mon, 17 Feb 2003 12:41:52 -0800 (PST) Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24]) by mx1.FreeBSD.org (Postfix) with ESMTP id 7094A43FAF for ; Mon, 17 Feb 2003 12:41:51 -0800 (PST) (envelope-from bmilekic@unixdaemons.com) Received: (from bmilekic@localhost) by tesla.distributel.net (8.11.6/8.11.6) id h1HKfR366265; Mon, 17 Feb 2003 15:41:27 -0500 (EST) (envelope-from bmilekic@unixdaemons.com) Date: Mon, 17 Feb 2003 15:41:27 -0500 From: Bosko Milekic To: Matthew Dillon Cc: Andrew Gallatin , freebsd-arch@FreeBSD.ORG Subject: Re: mb_alloc cache balancer / garbage collector Message-ID: <20030217154127.A66206@unixdaemons.com> References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <200302171742.h1HHgSOq097182@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 17, 2003 at 09:42:28AM -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, Feb 17, 2003 at 09:42:28AM -0800, Matthew Dillon wrote: > The work looks great, but I have to say I have grave reservations > over any sort of thread-based cleanup / balancing code for a memory > allocation subsystem. The only advantage that I can see is that you > get good L1 cache effects, but that is counterbalanced by a number of > severe disadvantages which have taken a long time to clear up in other > subsystems which use separate threads (pageout daemon, buf daemon, > syncer, pmap garbage collection from the pageout daemon, etc). Most of > these daemons have very good reasons for needing a thread, but I can't > think of any reason why a straight memory allocator would *require* > a thread. > > Wouldn't it be easier and more scaleable to implement the hysteresis on > the fly? It sounds like it ought to be simple... you have a sysctl > to set the per-cpu free cache size and hysteresis (for example, 32[8], > aka upon reaching 32 free 32 - 8 = 24 to the global cache, keeping 8). > Overflow goes into a global pool. Active systems do not usually > bounce from 0 to the maximum number of mbufs and back again, over > and over again. Instead they tend to have smaller swings and 'drift' > towards the edges, so per-cpu hysteresis should not have to exceed > 10% of the total available buffer space in order to reap the maximum > locality of reference and mutex benefit. Even in a very heavily loaded > system I would expect something like 128[64] to be sufficient. This > sort of hysteresis could be implemented trivially in the main mbuf > freeing code without any need for a thread and would have the same > performance / L1 cache characteristics. Additionally, on-the-fly > hysteresis would be able to handle extreme situations that a thread > could not (such as extreme swings), and on-the-fly hysteresis can > scale in severe or extreme situations while a thread cannot. The allocator does do some hysteresis for what concerns the per-CPU caches on the fly. It will move a bucket over to the global cache if the pcpu cache has gone above the high watermark and we're freeing. It's pretty easy to teach it to move more than a single bucket, too, if we find that it's worthwhile. Perhaps tuning the watermark code to do this more efficiently is worth looking at. What the daemon does is replenish the per-CPU caches (if necessary) in one shot without imposing the overhead on the allocation path. That is, it'll move a bunch of buckets over to the per-CPU caches if they are under-populated; doing that from the main allocation path is theoretically possible but tends to produce high spiking in latency. So what the daemon basically is is a compromise between doing it in the allocation/free path on-the-fly, and doing it from a parallel thread. Additionally, the daemon will empty part of the global cache to the VM, if it needs to - this process is relatively expensive and also produces irregularities in performance, particularly if you decide to do it in the main free path. One of the things I really wanted to focus on was significantly minimizing any VM interactions during network buffer allocations and frees. Once you start minimizing the number of times you'll be going back to the VM with watermarks, you inevitably increase the number of checks you have to do in the regular free case. If you then minimize the number of checks and computations to determine when to flush to VM and when not to, you often end up flushing too often. So it's a tradeoff, really. Perhaps you'll eventually converge to some 'reasonable' compromise, but if you can do it from a thread scheduled in parallel, then it's even easier as long as, as you say, there are no complicated issues to deal with because of the fact that suddenly you have this daemon which runs in parallel and modifies the behavior of your allocations. In this case, though, the allocator was designed with the idea that freeing and balancing would be implemented from a kproc scheduled in parallel anyway, so I hope that those complexities are a non-issue. So, in summary, the daemon here is not the only thing doing the balancing; it's a "compromise," if you will, if both models. As for "extreme swings," I would tend to think that it's the contrary. Like, say you have a huge spike in allocations; with the current model, you'll be able to handle it even in the extreme case and you'll be able to recover via the kproc. If you have a series of huge spikes, then this model may in fact even work out better for you because, due to scheduling, you may defer all attempts to balance caches until the end of the spike, so you may actually avoid ping-ponging of buckets from cache to cache because you won't be relying on the spike data to balance in the long term. Anyway, all this is pretty theoretical talk. My intention is to tune this thing and further evaluate performance based on the requirements of real life applications. > The same argument could also be applied to UMA, btw. > > -Matt > Matthew Dillon > -- Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org "If we open a quarrel between the past and the present, we shall find that we have lost the future." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message