Date: Mon, 17 Feb 2003 10:05:24 -0800 From: Terry Lambert <tlambert2@mindspring.com> To: Matthew Dillon <dillon@apollo.backplane.com> Cc: Bosko Milekic <bmilekic@unixdaemons.com>, Andrew Gallatin <gallatin@cs.duke.edu>, freebsd-arch@FreeBSD.ORG Subject: Re: mb_alloc cache balancer / garbage collector Message-ID: <3E512464.2D37555B@mindspring.com> References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Matthew Dillon wrote: > The work looks great, but I have to say I have grave reservations > over any sort of thread-based cleanup / balancing code for a memory > allocation subsystem. The only advantage that I can see is that you > get good L1 cache effects, but that is counterbalanced by a number of > severe disadvantages which have taken a long time to clear up in other > subsystems which use separate threads (pageout daemon, buf daemon, > syncer, pmap garbage collection from the pageout daemon, etc). Most of > these daemons have very good reasons for needing a thread, but I can't > think of any reason why a straight memory allocator would *require* > a thread. The Classic Sequent paper on this uses a two level garbage collector, but avoids the MACH "mark and sweep" stle approach with a seperate GC process, by maintaining GC information dynamically, and coelescing when it becomes possible (when blocks are released to the coelesce-to-page page layer, it performs its accounting at that time). See the paper (1993): http://citeseer.nj.nec.com/484408.html > Wouldn't it be easier and more scaleable to implement the hysteresis on > the fly? It sounds like it ought to be simple... you have a sysctl > to set the per-cpu free cache size and hysteresis (for example, 32[8], > aka upon reaching 32 free 32 - 8 = 24 to the global cache, keeping 8). > Overflow goes into a global pool. Active systems do not usually > bounce from 0 to the maximum number of mbufs and back again, over > and over again. Instead they tend to have smaller swings and 'drift' > towards the edges, so per-cpu hysteresis should not have to exceed > 10% of the total available buffer space in order to reap the maximum > locality of reference and mutex benefit. Even in a very heavily loaded > system I would expect something like 128[64] to be sufficient. This > sort of hysteresis could be implemented trivially in the main mbuf > freeing code without any need for a thread and would have the same > performance / L1 cache characteristics. Additionally, on-the-fly > hysteresis would be able to handle extreme situations that a thread > could not (such as extreme swings), and on-the-fly hysteresis can > scale in severe or extreme situations while a thread cannot. > > The same argument could also be applied to UMA, btw. The one drawback in your approach is that you could end up hitting the global pool on each allocation, in the worst case. It's better to bound the transfer sizes, and add a third layer, as Sequent did in the Dynix allocator. By bounding the transfer sizes to some multiple number of allocation units, you get to amortize the cost. With a simple hysteresis, you effectively end up implementing a sliding, rather than a fixed window size, and that makes the reclaimer vastly more complicated than it needs to be (FWIW). If you wanted to use that paper reference as a reference to look for more recent work (the McKenney/Slingwine paper is 10 years old, now), there's also plenty of more recent work, though most of it is in the context of NUMA/iNUMA systems, which seem to be bad words around here these days... -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3E512464.2D37555B>