From owner-freebsd-arch@FreeBSD.ORG Fri Jun 20 01:34:40 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 32B7C37B401; Fri, 20 Jun 2003 01:34:40 -0700 (PDT) Received: from bluejay.mail.pas.earthlink.net (bluejay.mail.pas.earthlink.net [207.217.120.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id 5007C43F75; Fri, 20 Jun 2003 01:34:39 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-uinj93o.dialup.mindspring.com ([165.121.164.120] helo=mindspring.com) by bluejay.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19THLv-0003oS-00; Fri, 20 Jun 2003 01:34:36 -0700 Message-ID: <3EF2C67D.65F8A635@mindspring.com> Date: Fri, 20 Jun 2003 01:31:57 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: David Schultz References: <20030618112226.GA42606@fling-wing.demos.su> <39081.1055937209@critter.freebsd.dk> <20030619113457.GA80739@HAL9000.homeunix.com> <3EF2969F.4EE7D6D4@mindspring.com> <20030620061010.GA85747@HAL9000.homeunix.com> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a463f9e0c5b10abbcd11017d1b92c96a5b548b785378294e88350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: arch@freebsd.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Jun 2003 08:34:40 -0000 David Schultz wrote: > Yes, and my point was that it's important to maintain the > separation, at least implicitly, in any new design. I think this > point was obvious to the people concerned before I even mentioned > it, so there's no need to rehash it, but the designers of certain > other operating systems seem to have missed it. Well, Solaris "reinvented" the seperate VM and buffer cache in Solaris 2.8. 8-(. I wasn't sure what you were recommending from what you said. > > This basically says that you need to stall dependency memory > > allocation at a high watermark, and force the update clock to > > tick until the problem is eliminated. The acceleration of the > > update clock that takes place today is insufficient for this: > > you need to force the tick, wait for the completion, and force > > the next tick, etc., until you get back to your low water mark. > > If you just accelerate the clock, the hysteresis will keep you > > in a constant state of thrashing. > > Last year I was saying something similar to what you just said, > before Kirk convinced me that I was wrong. ;-) 8-) 8-). > The main problem isn't metastability or the lack of deadlock > detection, it's that some workloads reasonably require more > dependency tracking than the buffer cache can accomodate. At > present, we can't track more than about 50 directories in the > buffer cache. I don't know if I buy this directly. It's probably possible to commit an incomplete tree, as long as it's complete from the root, at any subtree point. Doing this, though, you would have to switch from isosynchronous to synchronus processing on the subtree for the remainder of its duration. This works, because you use the associative property of the tree above to replace it with a single edge segment; other orphan subtrees of the same tree all have to fall into the same mode. This seems smart, but it's incredibly nasty, if you don't insert a stall barrier, and permit the existing elements to flush out synchronously before adding more dependencies. Otherwise, you end up getting a load spice, and (effectively) switching from soft dependencies to synchronous writes, at least for the most busy part of your dependency graph, Hmmph. I don't see a way around this, short of making the update clock wheel bigger, and I don't see an easy way of doing that while it has entries in it at all. I think no matter what, such a workload is going to end up in a degenerate case and get you thrashing. What was Kirk's answer? > Still, the opposite problem of allowing the accumulation of many > dependencies that have to be written anyway concerns me. I guess > that's where a clever flushing algorithm comes in. [1] points out > that Solaris 2.6 and 7 had a clever balancing algorithm between > the FS and VM caches, too, but that wound up being tossed out in > favor of a separate FS metadata cache in Solaris 8. But Solaris > doesn't do softupdates, so it doesn't have a tradeoff between > memory pressure and effective dependency tracking. So I don't know > what the right answer is for FreeBSD. Solaris 8 and up has its own bogons because of their reseperation of the cache (as previously noted). I understand from a complexity perspective why they made the choice, but I'm not sure it was right, even if they did have to face the problem FreeBSD faces in this case. Maybe the answer is to not let the relationship graph ever get that big in the first place; effectively, you would have to be however many edges deep as it took to circle the entire soft updates clock wheel. One thing that occurs to me is to not tick over the wheel until you have data on it, and run a third hand on the two-handed clock to make the decision on advancing the insertion pointer vs. advancing the flushing pointer. This would keep time-sparse, locality-dense operations (e.g. put operation A in slot X, and operation B in slot X+n) from getting too seperated on the wheel. This wouldn't solve the problem, of course, but it would greatly reduce the sparseness for consecutive dependent operations that didn't happen back-to-back temporally. It would probably save you a factor of exp2(n-1), on average, for a forced insertion seperation between dependent operations of 'n'. Your wheel could handle that times more depth to the graph. But the quoted "50" is the ideal, when all dependent operations occur in the same tick, given the current wheel size; all this strategy does is up the number (the real number isn't 50, it's unfortunately 'size - max_n - 1') by making them occur virtually in the same tick, even if they are spread out temporally otherwise. I think the only answer is to come up with something other than the wheel, or bite the bullet and stall all operations that want to write to the wheel and flush it completely, when you hit some high water mark. The interactive response in this case could be a pretty long dramatic pause... the same thing we had when the lock on the buffers was owned by the writers, once queued, instead of the queue (so they could be second-chanced and re-queued... Matt Dillon's work, if I remember correctly). > > > The original buffer cache design is untenable largely because > > > Dyson wanted to maintain compatibility with existing FS > > > interfaces. > > > > At the time, the problem was that the vmobject_t's were not > > reference counted, and allowed to be aliased. [...] > > You're describing a separate problem from the one I'm thinking of, > but probably also a valid one. My knowledge of BSD doesn't extend > back that far. Not really; the issue arose in the first place because the VM implementation was, as Poul put it, "incomplete". That was an apt insight by Poul, and an important one. The point I wanted to make is that FreeBSD should not throw the baby out with the bathwater: the VM and buffer cache unification was right, IMO, for a lot of reasons, even if disallowing the intentional aliases after the unintentional ones were fixed, and making every vnode require a seperate vmobject_t. Yes, FreeBSD has historical baggage that needs someone to clear it away, but undoing the VM and buffer cache unification is not part of that, it's just something that happened concurrently. The unification *process* is probably the root cause of some of the current evil, but the unification *per se* is not. I just wanted to make it very clear. I would deperately hate to lose the page-flipping trick that FreeBSD plays that make its pipe and UNIX domain sockets so balzingly fast, compared to everyone else (and that' just one example of many). -- Terry