From owner-freebsd-fs Thu Nov 18 15:20:50 1999 Delivered-To: freebsd-fs@freebsd.org Received: from cs.columbia.edu (cs.columbia.edu [128.59.16.20]) by hub.freebsd.org (Postfix) with ESMTP id 8BCD81508E; Thu, 18 Nov 1999 15:20:45 -0800 (PST) (envelope-from ezk@shekel.mcl.cs.columbia.edu) Received: from shekel.mcl.cs.columbia.edu (shekel.mcl.cs.columbia.edu [128.59.18.15]) by cs.columbia.edu (8.9.1/8.9.1) with ESMTP id SAA29976; Thu, 18 Nov 1999 18:20:44 -0500 (EST) Received: (from ezk@localhost) by shekel.mcl.cs.columbia.edu (8.9.1/8.9.1) id SAA15756; Thu, 18 Nov 1999 18:20:43 -0500 (EST) Date: Thu, 18 Nov 1999 18:20:43 -0500 (EST) Message-Id: <199911182320.SAA15756@shekel.mcl.cs.columbia.edu> X-Authentication-Warning: shekel.mcl.cs.columbia.edu: ezk set sender to ezk@shekel.mcl.cs.columbia.edu using -f From: Erez Zadok To: Eivind Eklund Cc: Erez Zadok , fs@FreeBSD.ORG Subject: Re: namei() and freeing componentnames In-reply-to: Your message of "Thu, 18 Nov 1999 15:32:20 +0100." <19991118153220.E45524@bitbox.follo.net> Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org In message <19991118153220.E45524@bitbox.follo.net>, Eivind Eklund writes: > [Note to impatient readers - forward view if included at the bottom of > this mail] > > On Mon, Nov 15, 1999 at 06:12:09PM -0500, Erez Zadok wrote: > > In message <19991112000359.A256@bitbox.follo.net>, Eivind Eklund writes: [...] > The problem I'm finding with VOP_RELEASEND() is that namei() can > return two different vps - the dvp (directory vp) and the actual vp > (inside the directory dvp points at), and that neither of these are > always available. > > As I am writing the code right now, I am using either of these, with a > preference for the dvp. I am considering splitting VOP_RELEASEND() > into VOP_RELEASEND() and VOP_DRELEASEND(), which takes the different > VPs as parameters - this will at least give something that is easy to > search for if we need to change the behaviour somehow. I found similar "annoying" functionality in Solaris's open() routine. Sometimes it can return a new dvp, sometimes NULL, and sometimes a copy or reference to another vnode (I think due to dup() stuff). From my POV, after having ported stackable templates to several OSs, I found out that vnode/vfs functions that try to do too much make the life of a stackable f/s developer harder. Also, functions that behave differently under different (input) conditions also make it hard to work with. The reason is that stackable file systems have to be layer-independent. This means that they have to treat the file system on which they stacked as if they were the VFS calling that layer, and at the same time they must appear to the VFS as a low-level f/s. IOW, a stackable f/s is both a VFS and a lower-level f/s, and thus have to simulate and act as both. So whatever behavior your VFS has before it calls a VOP_* must be simulated accurately inside the stackable f/s before it calls the lower one. It is easier to achieve that when vnode/vfs functions are smaller, simpler, and behave the same always. So, I would say that if you think splitting VOP_RELEASEND in two would make things simpler, go for it here and everywhere else. The lesson learned from the Linux vfs (rapid :-) evolution is a good one: after adding more and more inode/file/dentry/super_block functions, and making them relatively small and simple, they found ways to push some of that functionality up to the VFS. [...] > Actually, I am reasonably confident that we can do the fixes without > impacting performance noticably. That's great! [...] > Forward view: I'm undecided on the next step. Possibilities: > (1) Change the way locking is specificied to make it feasible to test > locking patches properly, and change the assertion generation to > generate better assertions. This will probably require changing I'm not sure I understand what you mean by assertion generation. > VOP_ISLOCKED() to be able to take a process parameter, and return > different valued based on wether an exlusive lock is held by that > process or by another process. The present behaviour will be > available by passing NULL for this parameter. > > Presently, running multiple processes does not work properly, as > the assertions do not really assert the right things. > > These changes are necessary to properly debug the use of locks, > which I again believe is necessary for stacking layers (which I > would like to work in 4.0, but I don't know if I will be able to > have ready). Locks are probably one of the most frustrating things I've had to deal with, b/c you're rarely told whether the objects passed to you are already locked, allocated, and if their reference count has been updated, and what, if any, you have to do with all of these. FreeBSD is very nice by documenting most of these conventions in the vnode_if.src file, but Solaris and Linux don't. I've had to implement a strict un/locking order in my wrapfs templates, to avoid deadlocks. Some of that code is so hairy that I dread each time the (linux) vfs changes and I've got to touch my locking code; that's a sure way to waste several days debugging that. Deciding on proper locking is difficult. In linux, for example, they had most locking done in the VFS; sounds great at first b/c f/s code doesn't have to worry about locking objects. But they found out that to get better SMP performance, each f/s would have to do its own locking, and so they pushed some of the locking to be the f/s responsibility. Locking seems to be stuff that happens all over: part in the VFS, part in the VM/buffercache, and part inside file systems. Is there a way to make locking an explicit part of the vnode interface? Is there a way to keep locking in the VFS by default (for simplicity), but allow those f/s that want to, manage their own locks? How messy and maintainable such code would be? I guess what I'm arguing is for interface flexibility, so we don't have to revise it again any time soon. Eivind, if you haven't recently, I suggest you look at some of the stacking papers (Rosenthal's UI paper, Heidemann, Popek, Skinner/Wong, etc.) Rosenthal's "requirements" paper succinctly described several important issues, including atomicity of multi-vnode operations. Rosenthal suggested that kernels should have a full-transaction engine, which I think is eventually necessary, but it's very complex to put it. The next best thing is to do some form of safe locking. Normally each vnode/inode has its own lock. Imagine a replicated stackable f/s (replicfs) with fan-out of 3. So vnode (V0) at the level of "replicfs" would have access to three lower-vnodes (V1, V2, V3). If you want to make a change (say create a file) in V0, you have to lock V0-V4 at once. Without vfs support for this, replicfs would have to enforce ordered locking (such as I've done in wrapfs) and hope for the best. If the vfs is smarter, it can help replicfs lock all 4 vnodes at once; or the vfs can allow replicfs to control the locks below it, and all the vfs has to do is ensure that no one else can lock V1-V3. I don't have a good answer to this locking issue. The papers I've cited describe changes to the vnode interface that simplify locking. One way they do that is having only one lock per chain (or stack, or DAG) of stacked file systems. So for example, a DAG of stackable f/s is represented by one data structure that contains locks and other things that are true about the whole DAG, and then smaller data structures for each node/leaf of the DAG, containing stuff that's true about that vnode (e.g., operations vector). > (2) Change the behaviour of VOP_LOOKUP() to "eat as much as you can, > and return how much that was" rather than "Eat a single path > component; we have already decided what this is." > This allows different types of namespaces, and it allows > optimizations in VOP_LOOKUP() when several steps in the traversal > is inside a single filesystem (and hey - who mounts a > new filesystem on every directory they see, anyway?) > > This change is rather small, and it would be nice to have in 4.0 > (I want the VFS differences from 4.0 to 5.0 to be as small as > possible). > It is pretty orthogonal to stacking layers; stacking layers gain > the same capabilities as other file systems from it. Multi-component lookup has always been desirable. There's one paper by Duchamp (USENIX '94) on multi-component look in NFS. I think we should allow for multi-component lookup as well as the old style "one component at a time" lookup. I would argue that the default should still be the old style. Someone might want to write a stackable f/s that does special things as it traverses the pathname of each component. For example a general purpose unionfs (one which uses fan-out, unlike the single-stack design in bsd-4.4) might follow into different underlying directories as it looks up single components; unionfs has all kinds of interesting semantic issues that would require more flexibility at lookup time. Lookup is fairly complex as it is. If you're going to add multi-component lookup, then maybe it should be a new vop? If not a new vop, then make sure it's added to the current vop_lookup such that a f/s has enough flexibility to control the type of lookup it wants. Also, it would be nice if the type of lookup used can be controlled dynamically by the f/s itself (as opposed to, say, a mount() flag that sets the lookup type for the duration of the mount). > Eivind. Cheers, Erez. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message