Date: Thu, 8 Aug 1996 14:48:28 -0700 (MST) From: Terry Lambert <terry@lambert.org> To: dfr@render.com (Doug Rabson) Cc: terry@lambert.org, michaelh@cet.co.jp, jkh@time.cdrom.com, tony@fit.qut.edu.au, freebsd-fs@freebsd.org Subject: Re: NFS Diskless Dispare... Message-ID: <199608082148.OAA17616@phaeton.artisoft.com> In-Reply-To: <Pine.BSI.3.95.960808173810.10082U-100000@minnow.render.com> from "Doug Rabson" at Aug 8, 96 06:47:44 pm
next in thread | previous in thread | raw e-mail | index | archive | help
> > Conceptually, it's pretty tribial to support; it's not supported > > because the stacking is not correctly implemented for these FS's. > > Look at the /sys/miscfs/nullfs use of VOP_VGET. > > VFS_VGET is not implemented in NFS because the concept just doesn't apply. > VFS_VGET is only relavent for local filesystems. NFS does have a flat > namespace in terms of filehandles but not one which you could squeeze into > the VFS_VGET interface. The flat name space is the nfsnodes, not the file handles. In the NFS case, you would simply *not* implement recovery without reallocation. The allocation time is small compared to the wire time, and the actions could be interleaved by assuming a success response, with an additional dealloc overhead for the failure case. > > > You can't do this for NFS. If you use exclusive locks in NFS and a > > > server dies, you easily can end up holding onto a lock for the root vnode > > > until the server reboots. To make it work for NFS, you would have to make > > > the lock interruptable which forces you to fix code which does not check > > > the error return from VOP_LOCK all over the place. > > > > This is one of the "flags" fields, and it only applies to the NFS client > > code. Actually, since the NFSnode is not transiently destroyed as a > > result of server reboot (statelessness *is* a win, no matter what the > > RFS advocates would have you believe), there isn't a problem with holding > > the reference. > > So the NFS code would degrade the exclusive lock back to a shared lock? > Hmm. I don't think that would work since you can't get the exclusive lock > until all the shared lockers release their locks. You would unhold the lock and set reassert pending availability from the rpc.mount negotiation suceeding. Do this by setting up a fake sleep address. The trade off is between blocking a process (which you will have to do anyway) and hanging the kernel. The locks are local. The only possible race condition is local stacking on top of the NFS on the client side. You can either not allow it, or you can accept the fact that someone might win the thundering herd race (in which case you just get delayed a bit), or you can FIFO the request list with an array and a request entrancy limit when the array is full, where you degrade to thundering herd to get into the FIFO list. It's unlikely that someone will be running hundreds of process from an NFS server that crashes, and care who gets their page requests satisfied first. The delay from misordering is going to be *nothing* compared with the delay for a network resource which is unavailable long enough to have the request list fill up. I think it's a non-problem to unwind the state, and the collision avoidance is well worth the worst case being slightly degraded. Currently in BSD and SunOS, if the server can't satisfy a page request from one local process, it blocks and the whole system goes to hell. This way, only the processes which are relying on the unreliable resourvce go to hell. Even so, I still vote for flagging the NFS mount to force a copy to swap of any file being used as swap store from an unreliable server. It's a better long term soloution anyway. > > One of the things Sun recommends is not making the mounts on mount > > points in the root directory; to avoid exactly this scenario (it really > > doesn't matter in the diskless/dataless case, since you will hang on > > swap or page-in from image-file-as-swap-store anyway). > > It doesn't matter if they are on mount points in root. If a lock is stuck > in a sub-filesystem, then the 'sticking' can propagate across the mount > point. Well, yes, I suppose. There are better ways to fix that; specfically, lock the node that is covered before you lock the covering node in a mount point traversal. The issue is resolved locally after the second process waiting for the node, without propagating up past the mount point. I'm more concerned with interaction between multiple mounts of CDROM's on a changer device. It's more likely, if you ask me. Nevertheless, if you are running something off an NFS server, and it can't run, then it can't run. FreeBSD is no less graceful about that than any commercial OS. > > The root does not need to be locked for the node lookup for the root > > for a covering node in any case; this is an error in the "node x covers > > node y" case in the lookup case. You can see that the lookup code > > documents a race where it frees and relocks the parent node to avoid > > exactly this scenario, actually. A lock does not need to be held > > in the lookup for the parent in the NFS lookup case for the mount > > point traversal. I believe this is an error in the current code. > > Have to think about this some more. Are you saying that when lookup is > crossing a mountpoint, it does not need any locks in the parent > filesystem? It needs locks on the covered node, but it does not need to propagate the collision to root. The only case this fails is when / is NFS mounted and the server goes down. You have worse problems at that point, and hanging for the server to come back up is most likely the right thing to do in that case anyway. > > The easiest way to think of this is in terms of provider interfaces > > and consumer interfaces. There are many FS provider interfaces. The > > FS consumer interfaces are the syscall layer (the vfs_subr.c) and the > > NFS client. This goes hand in hand with the discussion we had about > ^^^^^^ > Do you mean NFS server here? Yes, thanks; sorry about that. > > the VOP_READDIR interface needing to be split into "get buffer/reference > > buffer element" (remember the conversation about killing off the cookie > > interface about a year ago?!?!). > > I remember that. I think I ended up agreeing with you about it. The > details are a bit vague.. I saved them; I can forward them if need be. The details were vague because I wanted an interface that let me tell it what I wanted back, but a struct direct only return would be acceptable for an interim implementation. That's one that could be broken up without too much trouble. [ ... ] > > Currently collapse is not implemented. Part of the support for > > collapse without full kernel recompilation on VOP addition was the > > 0->1 FS instance count changes to the vfs_init.c code and the > > addition of the structure sizing field in the vnode_if.c generation > > in my big patch set (where the vnode_if.c generated had the structure > > vfs_op_descs size computed in the vnode_if.c file. The change did > > not simply allow the transition from 0->N loadable FS's (part of > > the necessary work for discardable fallback drivers for the FS, > > assuming kernel paging at some point in the future), and it did not > > just allow you to add VFS OPS to the vnode_if without having to > > recompile all FS modules and LKM's (it's stated intent). The change > > also allows (with the inclusion of a structure sort, since the init > > causes a structure copy anyway to get it into a stack instantiation) > > the simplification of the vnode_if call to eliminate the intermediate > > functioncall stub: a necessary step towards call graph collapse. You > > want this so that if you have 10 FS layers in a stack, you only have > > to call one or two veto functions out of the 10... and if they are > > all NULL, the one is synthetic anyway. > > This is interesting. It is similar to the internal driver architecture we > use in our 3D graphics system (was Reality Lab, now Microsoft's Direct3D). > The driver is split up into different modules depending on functionality. > The consumer (Direct3D) has a stack which it pushes driver modules onto > for all the required functionality. This used to be useful for > reconfiguring the stack at runtime to select different rendering > algorithms etc. Direct3D broke that unfortunately but that is another > story. I cheated for this; the two "competing" vnode stacking architectures are Heidemann's (the one we are using) and Rosenthal's (which lost out). Rosenthal alludes to stack collapes in his Usenix paper on "A file system stacking architecture". The general problem with Rosenthal's support is the same problem Novell was having in their "Advanced File System Design": personal views. A personal view allows the FS to have a cannonical form, and each user can choose his view on the FS. The problem with this is the same problem Windows95 has now with desktop themes: support is impossible. Imagine the user who is told to "drag that icon to the wastebasket to fix the problem"... he may have a beaker of acid, or a black hole or a trash compactor, or whatever... there are not Schnelling points in common that the user and the technical support person agree on so that they can communicate effectively. You can steal the personal view idea of a cannonical form for a directory structure by specifying a cannonization name space for files regardless of their name in the real name space. This was the basis of some of my internationalization work about two years ago (the numeric name space suggestion that allowed you to rename system critical files like /etc/passwd to Japanese equivalents and have NIS and login keep working). Rosenthal needed stack collapse to reduce the memory requirements per view instance so he could have views at all. > It communicates with the drivers by sending service calls to the top > driver in the stack. Each service call has a well defined number. If > that module understands the service, it implements it and returns a > result. Otherwise, it passes the service call down to the next driver in > the stack. Some modules override service calls in lower layers and they > typically do their own work and then pass the service onto the next layer > in the stack. Yes. This is exactly how the Heidemann thesis wants the VFS stacking to work. It fails because of the way the integration into the Lite code occurred in a rush as a result of the USL lawsuit and settlement. Specifically, there's no concept of adding a new VFS OP without rebuilding FFS (which is used to get the max number of VFS OPs allowed, in the current FreeBSD/NetBSD code). > To optimise the system, we added a service call table in the stack head. > When a module is pushed onto the stack, it is called to 'bid' some of its > services into the service call table. Each module in turn going up the > stack puts a function pointer into the table for each of the services it > wants to implement. If it is overriding a lower module, it just > overwrites the pointer. This is not quite the same inheritance model. Basically, you still need to be able to call each inferior layer. Consider the unionfs that unions two NFS mounts. Any fan in/fan out layer must be non-null. > If you add service calls, nothing needs to recompile (as long as the > service call table is large enough) because the new services just go after > the existing ones. Yes. The vnode_if.c structure sizing is how I ensured the table was large enough: that's where the table is defined, so it should be where the size is defined. The use of the FFS table to do the sizing in the init was what broke the ability to add service calls dynamically in the Heidemann code a integrated into 4.4Lite. > > This is a big win in reducing the current code duplication, which you > > want to do not only to reduce code size, but to make FS's more robust. > > The common behaviours of FS's *should* be implemented in common code. > > Agreed. The worst candidates at the moment are VOP_LOOKUP, VOP_RENAME and > the general locking/releasing protocol, IMHO. The lookup is more difficult because of the way directory management is dependent on file management. It's not possible to remove the FFS directory management code and replace it with an override with the current code arrangement. Specifically, I can't replace the FFS directory structure code with a btree (for instance). The lookup path buffer deallocation patches pushing the deallocation up into the consumer interface where the allocation took place, was a move toward severability of the directory interface. It had a side effect of moving toward the anility to support multiple name spaces nf FS's that require it (VFAT/NTFS/UMSDOS/NETWARE/HFS), and of abstracting the component representation type (for Unicode support and more internationalization). This doesn't resolve the seperability problem of the directory code, but is goes a long way toward freeing up the dependencies to allow incremental changes. I seriously dislike the relookup for the rename code, and think that it needs to be rethought. Bute seperability was a necessary first step. > > The Lite2 code recognizes this at the VOP_LOCK level in a primitive > > fashion by introducing the lockmgr() call, but since the model is not > > uniformly applied, and deadly-embrace or two caller starvation deadlocks > > can still occur in the Lite2 model. Going to the next step, a veto > > model, both increases the code robustness considerably, as well as > > resolving the state wind/unwind problems inherent in fan out. The > > fan out problem is *the* problem with the unionfs, at this point. > > Well at the moment, I think we have to just grit our teeth and merge in > the lite2 code as it stands. We have to at least try to converge with the > other strains of 4.4, if only to try and share the load of maintaining the > filesystem code. I strongly believe that there should be a consensus > between the different 4.4 groups over FS development or we just end up > with chaos. The Lite 2 code merge *needs* to take place. I need to spend more time on it now that I'm good for work+ 1 hour or so a day of sitting. I'll susbscribe to that list pretty soon now. As to maintenance and design... well, I think we have a problem no matter what we do. The Heidemann thesis, and the other FICUS documents are *the* design document, IMO. The problem is that the current code in the 4.4 camps does not conform to the design documents. I think that no matter what, that needs to be corrected. Then there are issues of kludges for the interface design, or for missing technology pieces that simply have not been considered in the 4.4 code. The biggest kludge is that there is no documented bottom-end interface. We already have an unresolvable discrepancy because of VM difference. The second biggest kludge is the workaround for the directory structure size differences... the origin of the "cookie" crap in the VOP_READDIR interface. NetBSD and FreeBSD solved this problem in a bad way, and are in fact not interoperable at this point because of that. Finally, there's the fact that 4.4 as shipped didn't support kernel module loading of any kind, and so there was no effort to limit the recompilation necessary for adding VOP's in the default vfs_init, the vfs_fs_init, or in the vnode_if method generation code. Short of starting up CSRG again, I don't see a common source for solving the Lite/Lite2 VFS problems. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?199608082148.OAA17616>
