Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Feb 2003 20:58:26 -0800 (PST)
From:      Matthew Dillon <dillon@apollo.backplane.com>
To:        Bosko Milekic <bmilekic@unixdaemons.com>
Cc:        freebsd-arch@FreeBSD.ORG
Subject:   Re: mb_alloc cache balancer / garbage collector
Message-ID:  <200302180458.h1I4wQiA048763@apollo.backplane.com>
References:  <20030216213552.A63109@unixdaemons.com> <15952.62746.260872.18687@grasshopper.cs.duke.edu> <20030217095842.D64558@unixdaemons.com> <200302171742.h1HHgSOq097182@apollo.backplane.com> <20030217154127.A66206@unixdaemons.com> <200302180000.h1I00bvl000432@apollo.backplane.com> <20030217192418.A67144@unixdaemons.com> <20030217192952.A67225@unixdaemons.com> <200302180101.h1I11AWr001132@apollo.backplane.com> <20030217203306.A67720@unixdaemons.com>

next in thread | previous in thread | raw e-mail | index | archive | help
:>     I guess I still don't understand the point of the daemon.  The per-cpu
:>     caches are limited (in your patch) to 512 mbufs / 128 clusters.  This
:>     represents very little memory even if you multiply by ncpus.  We shouldn't
:>     have to 'balance' anything.  Who cares if there are 511 mbufs sitting
:>     on cpu 0's cache that aren't being used?  These numbers are going to be
:>     tuned for the machine (for example, based on the amount of main memory),
:>     and are far smaller then the total possible.
:
:  I never said that those (totally arbitrary, by the way) numbers are
:  ideal.  In fact, I think they should be changed.

    I can see adjusting them dynamically in an attempt to avoid hitting
    the hysteresis points too often, up to a point, but changing the
    numbers doesn't change the associated issues.  I expect the defaults
    you have chosen to work fairly well across a broad range.  You wouldn't
    want to make the numbers arbitrarily large just to avoid hysteresis,
    it would unbalance the rest of the system.  Nor is it a good idea to
    just assume that your garbage collection thread can magically solve
    all the degenerate cases that pop up under varying load conditions.
    The per-cpu maximums have to be fairly low relative to availability
    in the global queue or you will have our memory subsystem going in
    circles from thread to thread trying to shove memory around.

:>     The only case that matters is if a per-cpu cache gets blown up by an
:>     inordinate number of frees being done to it.  That is, when the mbuf
:>     or cluster count exceeds mbuf_limit or clust_limit.
:>
:>     Why is the daemon more preferable for handling this case verses freeing
:>     a bunch (like 8 or 16) mbufs/clusters on the fly at the time of the
:>     free when the per-cpu cache exceeds the limit?  I don't see any advantage
:>     to having the daemon at all, and I see several disadvantages.
:
:  You can't just 'free' a bunch of mbufs back to the VM.  You free them
:  wherever you got them from (usually your pcpu cache).  If you exceed
:  mbuf_limit on your pcpu cache you'll migrate a bucket over to the
:  global cache, which is what you want.  However if your global cache
:  becomes too 'blown up' as you say, then you may want to recover the
:  unused physical pages.  Doing that directly from the free has several
:  disadvantages;
:  It can be expensive in more ways than one; for one, the VM call
:  itself is extra overhead.  Secondly, sometimes freeing a page means
:  traversing the cache until you hit a page worth of free mbufs to free,
:  so even though you may really need to free a page you'll never
:  actually get to freeing it unless you start traversing the list of
:  buckets in the cache; and that's expensive for a simple free - common
:  case or not.

    Remember you are talking about two memory subsystems here.  There
    was a suggestion a little while back in the thread that a better
    solution might be to integrate the mbuf allocator with UMA.  That's
    really my main point.  Use UMA and solve the global cache -> global
    VM issue in UMA.

    I have to disagree with your idea of 'expense'.  At the point where
    freeing things on-the-fly becomes 'too expensive' your kernel thread
    will *already* be overloaded and messing up the system in other ways.

    Here's an example:  Lets say we have an extreme mbuf load.  Not so
    much in allocations, but in the *rate* of allocation and the *rate*
    of freeing.  Now lets say you hit a hysteresis point.  With the
    thread idea you wakeup your thread and continue on your merry way.
    You are assuming that your thread will be able to handle it.  But
    this may not be true.

    Now lets say you are doing things on the fly and hit the hysteresis
    point.  What will happen now is rather simple:  Once you go over the
    upper bound you need to free mbufs until you hit the lower bound. 
    You want to free more then one at a time for efficiency, but you *don't*
    need to free all the mbufs at once.  What you do is simply
    free, say, 5 mbufs at a time for every call to free an mbuf until
    the levels drop to the lower bound.  In otherwords, latency can be
    fully controlled with an on-the-fly solution because it is fully
    self-pacing.

    Now lets go back and look at the thread.  Lets say something gets
    unbalanced and you hit your upper bound again, and start the thread
    going.  How many mbufs is the thread going to free at once?  Is it
    going to free the entire wad required to get back to the lower bound?
    How will this effect the latency of other processes?  Of the pageout
    daemon, for example, or even of user processes which until your thread
    started running were doing a fair job draining the TCP and UDP
    buffers they've been processing.  Unlike the on-the-fly method you
    can't really 'pace' the thread, because of the huge overhead in
    going to sleep every few milliseconds verses the overhead of freeing
    the mbufs.

    In otherwords, the question becomes:  How do you intend to control
    the latency your thread is now causing in the system.  I can pace
    the on-the-fly method trivially... in like four lines of code.  How
    do you solve the same problem with your thread?  It isn't as simple
    as giving it a fixed priority that is less then X and greater then Y.

:  By doing the freeing from the kproc context you're not interfering
:  with parallel allocations but you're also not taking longer than it
:  takes to just cache the data being freed for the free case.  That's a
:  big advantage.  By having the kproc also fill the pcpu caches

    I disagree with this.  I don't see how the thread can possibly 
    make a difference vis-a-vie parallel allocations.  They work 
    approximately the same either way.  In making this statement you 
    are assuming that your thread is getting cpu cycles that magically
    don't interfere with anything else going on in the system.  I
    don't think you can make this statement without some more analysis.

    If you agree that dynamically adjusting the hysteresis points 
    results in fewer thread wakeups, those same adjustments will also
    result in fewer 'extra' on-the-fly actions.

:  according the the configurable watermarks you're ensuring to have a
:  certain number of objects cached and ready for immediate allocations,
:  again without taking longer than it takes to just retrieve the object
:  being allocated from the cache for the allocation case.

    This is far from certain.  You are again assuming that your thread
    is able to operate in a fixed period of time, without interfering
    with other things going on (like user processes which are draining
    TCP buffers and freeing mbufs back to the caches) to provide
    this assurance.

:  Perhaps I can address your concerns if you give me a specific example
:  where you think the daemon is doing a bad thing, then I can work on
:  fixing that.  I think for corner cases it would even make sense to
:  explicitly lower the watermarks (thus forcing the daemon to drain the
:  caches) directly from the VM, if that's really determined to be an
:  issue.
:...
:-- 
:Bosko Milekic * bmilekic@unixdaemons.com * bmilekic@FreeBSD.org


    Well, Julian's example seemed pretty good, but it's not actually what
    I am worried about the most.  What I am worried about the most is an
    effect I saw on BEST Internet's heavily loaded machines quite often,
    especially the old Challenge L's.  The effect I am worried about is
    when system disk and/or network and/or cpu load becomes high enough
    to create artificial slowdowns in apparently unrelated processes.
    These slowdowns then lead to an increase in buffered data (like TCP
    data) and processes completing their work less quickly, leading to
    more processes as new connections come into the machine, and the
    whole thing spiraling out of control.

    The advantage of doing things on the fly is that you can 'smooth the
    curve'.  that is, you approach the point of unusability rather
    then fall over a cliff and suddenly the machine is dead.  It took
    an insane amount of effort to make the pageout daemon work that
    way and I'm afraid that your little process will require at least
    as much work to achieve the same result.

					-Matt
					Matthew Dillon 
					<dillon@backplane.com>

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-arch" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200302180458.h1I4wQiA048763>