Date: Mon, 17 Feb 2003 20:58:26 -0800 (PST) From: Matthew Dillon <dillon@apollo.backplane.com> To: Bosko Milekic <bmilekic@unixdaemons.com> Cc: freebsd-arch@FreeBSD.ORG Subject: Re: mb_alloc cache balancer / garbage collector Message-ID: <200302180458.h1I4wQiA048763@apollo.backplane.com> References: <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com> <20030217192418.A67144@unixdaemons.com> <20030217192952.A67225@unixdaemons.com> <200302180101.h1I11AWr001132@apollo.backplane.com> <20030217203306.A67720@unixdaemons.com>
next in thread | previous in thread | raw e-mail | index | archive | help
:> I guess I still don't understand the point of the daemon. The per-cpu :> caches are limited (in your patch) to 512 mbufs / 128 clusters. This :> represents very little memory even if you multiply by ncpus. We shouldn't :> have to 'balance' anything. Who cares if there are 511 mbufs sitting :> on cpu 0's cache that aren't being used? These numbers are going to be :> tuned for the machine (for example, based on the amount of main memory), :> and are far smaller then the total possible. : : I never said that those (totally arbitrary, by the way) numbers are : ideal. In fact, I think they should be changed. I can see adjusting them dynamically in an attempt to avoid hitting the hysteresis points too often, up to a point, but changing the numbers doesn't change the associated issues. I expect the defaults you have chosen to work fairly well across a broad range. You wouldn't want to make the numbers arbitrarily large just to avoid hysteresis, it would unbalance the rest of the system. Nor is it a good idea to just assume that your garbage collection thread can magically solve all the degenerate cases that pop up under varying load conditions. The per-cpu maximums have to be fairly low relative to availability in the global queue or you will have our memory subsystem going in circles from thread to thread trying to shove memory around. :> The only case that matters is if a per-cpu cache gets blown up by an :> inordinate number of frees being done to it. That is, when the mbuf :> or cluster count exceeds mbuf_limit or clust_limit. :> :> Why is the daemon more preferable for handling this case verses freeing :> a bunch (like 8 or 16) mbufs/clusters on the fly at the time of the :> free when the per-cpu cache exceeds the limit? I don't see any advantage :> to having the daemon at all, and I see several disadvantages. : : You can't just 'free' a bunch of mbufs back to the VM. You free them : wherever you got them from (usually your pcpu cache). If you exceed : mbuf_limit on your pcpu cache you'll migrate a bucket over to the : global cache, which is what you want. However if your global cache : becomes too 'blown up' as you say, then you may want to recover the : unused physical pages. Doing that directly from the free has several : disadvantages; : It can be expensive in more ways than one; for one, the VM call : itself is extra overhead. Secondly, sometimes freeing a page means : traversing the cache until you hit a page worth of free mbufs to free, : so even though you may really need to free a page you'll never : actually get to freeing it unless you start traversing the list of : buckets in the cache; and that's expensive for a simple free - common : case or not. Remember you are talking about two memory subsystems here. There was a suggestion a little while back in the thread that a better solution might be to integrate the mbuf allocator with UMA. That's really my main point. Use UMA and solve the global cache -> global VM issue in UMA. I have to disagree with your idea of 'expense'. At the point where freeing things on-the-fly becomes 'too expensive' your kernel thread will *already* be overloaded and messing up the system in other ways. Here's an example: Lets say we have an extreme mbuf load. Not so much in allocations, but in the *rate* of allocation and the *rate* of freeing. Now lets say you hit a hysteresis point. With the thread idea you wakeup your thread and continue on your merry way. You are assuming that your thread will be able to handle it. But this may not be true. Now lets say you are doing things on the fly and hit the hysteresis point. What will happen now is rather simple: Once you go over the upper bound you need to free mbufs until you hit the lower bound. You want to free more then one at a time for efficiency, but you *don't* need to free all the mbufs at once. What you do is simply free, say, 5 mbufs at a time for every call to free an mbuf until the levels drop to the lower bound. In otherwords, latency can be fully controlled with an on-the-fly solution because it is fully self-pacing. Now lets go back and look at the thread. Lets say something gets unbalanced and you hit your upper bound again, and start the thread going. How many mbufs is the thread going to free at once? Is it going to free the entire wad required to get back to the lower bound? How will this effect the latency of other processes? Of the pageout daemon, for example, or even of user processes which until your thread started running were doing a fair job draining the TCP and UDP buffers they've been processing. Unlike the on-the-fly method you can't really 'pace' the thread, because of the huge overhead in going to sleep every few milliseconds verses the overhead of freeing the mbufs. In otherwords, the question becomes: How do you intend to control the latency your thread is now causing in the system. I can pace the on-the-fly method trivially... in like four lines of code. How do you solve the same problem with your thread? It isn't as simple as giving it a fixed priority that is less then X and greater then Y. : By doing the freeing from the kproc context you're not interfering : with parallel allocations but you're also not taking longer than it : takes to just cache the data being freed for the free case. That's a : big advantage. By having the kproc also fill the pcpu caches I disagree with this. I don't see how the thread can possibly make a difference vis-a-vie parallel allocations. They work approximately the same either way. In making this statement you are assuming that your thread is getting cpu cycles that magically don't interfere with anything else going on in the system. I don't think you can make this statement without some more analysis. If you agree that dynamically adjusting the hysteresis points results in fewer thread wakeups, those same adjustments will also result in fewer 'extra' on-the-fly actions. : according the the configurable watermarks you're ensuring to have a : certain number of objects cached and ready for immediate allocations, : again without taking longer than it takes to just retrieve the object : being allocated from the cache for the allocation case. This is far from certain. You are again assuming that your thread is able to operate in a fixed period of time, without interfering with other things going on (like user processes which are draining TCP buffers and freeing mbufs back to the caches) to provide this assurance. : Perhaps I can address your concerns if you give me a specific example : where you think the daemon is doing a bad thing, then I can work on : fixing that. I think for corner cases it would even make sense to : explicitly lower the watermarks (thus forcing the daemon to drain the : caches) directly from the VM, if that's really determined to be an : issue. :... :-- :Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org Well, Julian's example seemed pretty good, but it's not actually what I am worried about the most. What I am worried about the most is an effect I saw on BEST Internet's heavily loaded machines quite often, especially the old Challenge L's. The effect I am worried about is when system disk and/or network and/or cpu load becomes high enough to create artificial slowdowns in apparently unrelated processes. These slowdowns then lead to an increase in buffered data (like TCP data) and processes completing their work less quickly, leading to more processes as new connections come into the machine, and the whole thing spiraling out of control. The advantage of doing things on the fly is that you can 'smooth the curve'. that is, you approach the point of unusability rather then fall over a cliff and suddenly the machine is dead. It took an insane amount of effort to make the pageout daemon work that way and I'm afraid that your little process will require at least as much work to achieve the same result. -Matt Matthew Dillon <dillon@backplane.com> To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-arch" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200302180458.h1I4wQiA048763>