Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 20 Jun 2003 01:31:57 -0700
From:      Terry Lambert <tlambert2@mindspring.com>
To:        David Schultz <das@freebsd.org>
Cc:        arch@freebsd.org
Subject:   Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c
Message-ID:  <3EF2C67D.65F8A635@mindspring.com>
References:  <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030619113457.GA80739@HAL9000.homeunix.com> <3EF2969F.4EE7D6D4@mindspring.com> <20030620061010.GA85747@HAL9000.homeunix.com>

next in thread | previous in thread | raw e-mail | index | archive | help
David Schultz wrote:
> Yes, and my point was that it's important to maintain the
> separation, at least implicitly, in any new design.  I think this
> point was obvious to the people concerned before I even mentioned
> it, so there's no need to rehash it, but the designers of certain
> other operating systems seem to have missed it.

Well, Solaris "reinvented" the seperate VM and buffer cache
in Solaris 2.8.  8-(.  I wasn't sure what you were recommending
from what you said.

> > This basically says that you need to stall dependency memory
> > allocation at a high watermark, and force the update clock to
> > tick until the problem is eliminated.  The acceleration of the
> > update clock that takes place today is insufficient for this:
> > you need to force the tick, wait for the completion, and force
> > the next tick, etc., until you get back to your low water mark.
> > If you just accelerate the clock, the hysteresis will keep you
> > in a constant state of thrashing.
> 
> Last year I was saying something similar to what you just said,
> before Kirk convinced me that I was wrong.  ;-)

8-) 8-).

> The main problem isn't metastability or the lack of deadlock
> detection, it's that some workloads reasonably require more
> dependency tracking than the buffer cache can accomodate.  At
> present, we can't track more than about 50 directories in the
> buffer cache.

I don't know if I buy this directly.  It's probably possible
to commit an incomplete tree, as long as it's complete from
the root, at any subtree point.  Doing this, though, you would
have to switch from isosynchronous to synchronus processing on
the subtree for the remainder of its duration.  This works,
because you use the associative property of the tree above to
replace it with a single edge segment; other orphan subtrees
of the same tree all have to fall into the same mode.

This seems smart, but it's incredibly nasty, if you don't insert
a stall barrier, and permit the existing elements to flush out
synchronously before adding more dependencies.  Otherwise, you
end up getting a load spice, and (effectively) switching from
soft dependencies to synchronous writes, at least for the most
busy part of your dependency graph,

Hmmph.  I don't see a way around this, short of making the
update clock wheel bigger, and I don't see an easy way of doing
that while it has entries in it at all.  I think no matter what,
such a workload is going to end up in a degenerate case and
get you thrashing.

What was Kirk's answer?


> Still, the opposite problem of allowing the accumulation of many
> dependencies that have to be written anyway concerns me.  I guess
> that's where a clever flushing algorithm comes in.  [1] points out
> that Solaris 2.6 and 7 had a clever balancing algorithm between
> the FS and VM caches, too, but that wound up being tossed out in
> favor of a separate FS metadata cache in Solaris 8.  But Solaris
> doesn't do softupdates, so it doesn't have a tradeoff between
> memory pressure and effective dependency tracking.  So I don't know
> what the right answer is for FreeBSD.

Solaris 8 and up has its own bogons because of their reseperation
of the cache (as previously noted).  I understand from a complexity
perspective why they made the choice, but I'm not sure it was right,
even if they did have to face the problem FreeBSD faces in this case.

Maybe the answer is to not let the relationship graph ever get that
big in the first place; effectively, you would have to be however
many edges deep as it took to circle the entire soft updates clock
wheel.

One thing that occurs to me is to not tick over the wheel until you
have data on it, and run a third hand on the two-handed clock to
make the decision on advancing the insertion pointer vs. advancing
the flushing pointer.  This would keep time-sparse, locality-dense
operations (e.g. put operation A in slot X, and operation B in slot
X+n) from getting too seperated on the wheel.  This wouldn't solve
the problem, of course, but it would greatly reduce the sparseness
for consecutive dependent operations that didn't happen back-to-back
temporally.  It would probably save you a factor of exp2(n-1), on
average, for a forced insertion seperation between dependent
operations of 'n'.  Your wheel could handle that times more depth
to the graph.

But the quoted "50" is the ideal, when all dependent operations
occur in the same tick, given the current wheel size; all this
strategy does is up the number (the real number isn't 50, it's
unfortunately 'size - max_n - 1') by making them occur virtually
in the same tick, even if they are spread out temporally otherwise.

I think the only answer is to come up with something other than
the wheel, or bite the bullet and stall all operations that want
to write to the wheel and flush it completely, when you hit some
high water mark.  The interactive response in this case could be
a pretty long dramatic pause... the same thing we had when the
lock on the buffers was owned by the writers, once queued, instead
of the queue (so they could be second-chanced and re-queued... Matt
Dillon's work, if I remember correctly).


> > > The original buffer cache design is untenable largely because
> > > Dyson wanted to maintain compatibility with existing FS
> > > interfaces.
> >
> > At the time, the problem was that the vmobject_t's were not
> > reference counted, and allowed to be aliased.  [...]
> 
> You're describing a separate problem from the one I'm thinking of,
> but probably also a valid one.  My knowledge of BSD doesn't extend
> back that far.

Not really; the issue arose in the first place because the VM
implementation was, as Poul put it, "incomplete".  That was an
apt insight by Poul, and an important one.

The point I wanted to make is that FreeBSD should not throw the
baby out with the bathwater: the VM and buffer cache unification
was right, IMO, for a lot of reasons, even if disallowing the
intentional aliases after the unintentional ones were fixed, and
making every vnode require a seperate vmobject_t.

Yes, FreeBSD has historical baggage that needs someone to clear
it away, but undoing the VM and buffer cache unification is not
part of that, it's just something that happened concurrently.

The unification *process* is probably the root cause of some of
the current evil, but the unification *per se* is not.

I just wanted to make it very clear.

I would deperately hate to lose the page-flipping trick that
FreeBSD plays that make its pipe and UNIX domain sockets so
balzingly fast, compared to everyone else (and that' just one
example of many).

-- Terry



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3EF2C67D.65F8A635>