From owner-freebsd-arch Mon Feb 17 16:24:46 2003 Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E770937B401 for ; Mon, 17 Feb 2003 16:24:42 -0800 (PST) Received: from tesla.distributel.net (nat.MTL.distributel.NET [66.38.181.24]) by mx1.FreeBSD.org (Postfix) with ESMTP id 245BE43FAF for ; Mon, 17 Feb 2003 16:24:42 -0800 (PST) (envelope-from bmilekic@unixdaemons.com) Received: (from bmilekic@localhost) by tesla.distributel.net (8.11.6/8.11.6) id h1I0OII67187; Mon, 17 Feb 2003 19:24:18 -0500 (EST) (envelope-from bmilekic@unixdaemons.com) Date: Mon, 17 Feb 2003 19:24:18 -0500 From: Bosko Milekic To: Matthew Dillon Cc: freebsd-arch@FreeBSD.ORG Subject: Re: mb_alloc cache balancer / garbage collector Message-ID: <20030217192418.A67144@unixdaemons.com> References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.2.5.1i In-Reply-To: <200302180000.h1I00bvl000432@apollo.backplane.com>; from dillon@apollo.backplane.com on Mon, Feb 17, 2003 at 04:00:37PM -0800 Sender: owner-freebsd-arch@FreeBSD.ORG Precedence: bulk List-ID: List-Archive: (Web Archive) List-Help: (List Instructions) List-Subscribe: List-Unsubscribe: X-Loop: FreeBSD.ORG On Mon, Feb 17, 2003 at 04:00:37PM -0800, Matthew Dillon wrote: > : What the daemon does is replenish the per-CPU caches (if necessary) in > : one shot without imposing the overhead on the allocation path. That > : is, it'll move a bunch of buckets over to the per-CPU caches if they > : are under-populated; doing that from the main allocation path is > : theoretically possible but tends to produce high spiking in latency. > : So what the daemon basically is is a compromise between doing it in > : the allocation/free path on-the-fly, and doing it from a parallel > : thread. Additionally, the daemon will empty part of the global cache > :... > > Hmm. Well, you can also replentish the per-CPU caches in-bulk on the fly. > You simply pull in more then one buffer and you will reap the same > overhead benefits in the allocation path. If you depend on a thread > to do this then you can create a situation where a chronic buffer shortage > in the per-cpu cache can occur if the thread doesn't get cpu quickly > enough, resulting in non-optimal operation. In otherwords, while it > may seem you are saving latency in the critical path (the network trying > to allocate a buffer), I think you might actually be creating a situation > where instead of latency you wind up with a critical shortage. Hmm, not quite. You'd need to look at the code; there is no shortage situation created here. As I said, the model I employ is not a purely balance-everything-from-the-daemon model. It is a compromise. In other words, if you can't get an object from the per-CPU cache, you'll try to get an object from the global cache. If you can get an object from the global cache, you'll take it an move a bucket of objects from the global cache to the per-CPU cache for future use. If you can't get an object from the global cache either, it's OK, you'll allocate from VM. The difference comes in the free case where you'll free the object to the bucket, wherever the bucket is sitting (usually this will be your per-CPU cache but it may be, in the non-common case, the global cache). You'll never flush any of the caches back to the VM or move anything more than a bucket of objects between caches in the allocation/deallocation cases. The daemon takes care of that when it can. So, you don't have a resource situation no matter what. > I don't think VM interaction is that big a deal. The VM system has a > notion of a 'shortage' and a 'severe shortage'. When you are allocating > mbufs from the global VM system into the per-cpu cache you simply > allocate up to into the cache or until the VM system gets > low (but not severely low) on memory. The hysteresis does not have to > be much to reap the benefits and mitigate the overhead of the global > mutex(es)... just 5 or 10 mbufs would mitigate global mutex overhead > to the point where it becomes irrelevant. I already pretty much do this. If I really need to, I *will* _allocate_ up to a bucket of mbufs or clusters from VM. A "bucket" right now is PAGE_SIZE-worth, but that's modifiable. > By creating a thread you are introducing more moving parts, and like > a physical system these moving parts are going to ineract with each > other. Remember, the VM system is *already* trying to ensure that > enough free pages exist in the system. If you have a second thread > eating memory in large globs it is far more likely that you will > destabilize the pageout daemon and create an oscillation between the > two threads (pageout daemon and your balancer). This might not turn up > in benchmarks (which tend to focus on just one subsystem), but it could > lead to some pretty nasty degenerate cases under heavy general loads. > I think it is far better to let the VM system do its job and pull the > mbufs in on-the-fly in smaller chunks which are less likely to destabilize > the pageout daemon. This will not happen in the common case. The one exception is if your caches are not balanced or are too low. Assuming that the watermarks are tuned properly you should always have about the average of the watermarks in your caches; if you don't, all the daemon will do is replenish them to that value. Once that's done, it won't do anymore replenishing unless you go low again. Further, if you spike and then return back to normal, the free code will end up moving buckets of objects back to the general cache and the daemon will only free back to the VM from the global cache and, again, it won't free everything, but just enough to bring back the general cache number of objects to the average of the watermarks. So, you can still allocate from the VM in your allocation paths if you need to, but instead of wasting time allocating a bunch of buckets, setting up your free object lists, etc.etc., you'll only allocate one bucket and let the daemon do the rest. Also, keep in mind that the maps for mbufs and clusters are finite, so no matter what you do, you're not going to be able to go beyond the size of those maps. The corner cases you're probably thinking of are those where the rest of the system is strapped for memory and your mbuf daemon may be holding on to too much. The thing is that the daemon should not be over-allocating large chunks unless the caches are really low anyway (you can set the low watermark, keep that in mind), and further, in the extreme case, you could even have the VM system wakeup the daemon to drain ALL the caches in seriously extreme situations (but those are really corner cases in which case you're probably screwed anyway). > This can be exasperated... made even worse, if your balancing thread is > given a high priority. So you have the potential to starve the mbuf > system if the balancing thread is too LOW a priority, and the potential > to destabilize the VM system if the balancing thread is too HIGH a > priority. > > Also, it seems to me that VM overheads are better addressed in the > UMA subsystem, not in a leaf allocation subsystem. Again, this is not a leaf-allocation subsystem anymore than the UMA allocator is. Both interface directly with kmem_malloc/kmem_free. > -Matt -- Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org "If we open a quarrel between the past and the present, we shall find that we have lost the future." To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message