Date: Sat, 23 Jun 2001 13:27:50 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: j mckitrick <jcm@freebsd-uk.eu.org> Cc: Terry Lambert <tlambert@primenet.com>, freebsd-chat@freebsd.org Subject: Re: most complex code in BSD? Message-ID: <3B34FBC6.535C799B@mindspring.com> References: <20010622221453.B64495@dogma.freebsd-uk.eu.org> <200106222143.OAA28673@usr06.primenet.com> <20010623133638.A84446@dogma.freebsd-uk.eu.org>
next in thread | previous in thread | raw e-mail | index | archive | help
j mckitrick wrote: > This sounds pretty much to me like a vector table, or even > thunking. How is it different? Also, if fsfunc49 (Joe > Walsh reference ;-) exists but the fs doesn't know about > it, how will it still work? Do you simply leave an empty > stub that returns without doing anything, or do you direct > it to a corresponding set of older calls that accomplish > the same thing? I didn't tell you the entire story, since this was about descriptor-based call interfaces, not about the VFS code in particular, except as an aside about how to understand what people view as the most complicated part of it. Note: PHK broke this somewhat when he added "default VOPs"; the intent is that you inherit upward to failure, not to some default for which there exists a real implementation. It will still work, but realize that there are some hard stops missing, where the system will claim an implementation exists, when it should actually fail. --- The easiest place to see this is the "ufs" entry points in the ffs_vnodeop_entries[] table. What I didn't tell you is that the vnode stacking works like this: [ vnode ] -> ,---------. | private | | data | -> [ vnode ] -> ,---------. `---------' | private | | data | `---------' There is a vnodeop_entries vector pointed to by each vnode; there is a vnode per stacking layer (or per stacking layer with implementation semantics, if the vectors are properly collapsed to ensure maximum efficiency: collapse would be through upward inheritance; more on this later). This entry vector is create at mount time, so if you have an fs stacked on top of an FS, you have (simplified) for each FS an entry for: read write stat readdir unlink rename etc.. When you stack, you get an implied "ENOENT", so: Default ENOENT ENOENT ENOENT ENOENT ENOENT ENOENT FS1 read write - - - - FS2 - - stat - - - FS3 - - - readdir unlink - Result read write stat readdir unlink ENOENT The "Result" FS is created by creating a collapsed vector made up of the topmost implementation you come to. Now say you had: Default ENOENT ENOENT ENOENT ENOENT ENOENT ENOENT FS1 read write - - - - FS2 - - stat - - - FS3 read write stat readdir unlink - Result read write stat readdir unlink ENOENT How does this work? The answer is, it's still the topmost implementations: the collapsed vector contains exactly the same things -- read, write from FS1, stat from FS2, readdir and unlink from FS3. The difference is that the FS1 read and write must be implemented by calling a read and write in an underlying layer: they are not permitted to do the disk I/O directly. Similarly, the FS2 stat must be implemented by calling a lower layer stat. In this case, FS1 and FS 2 are _stacking VFS layers_, and FS3 is a _local media VFS layer_. The difference is that FS3 can't stack on top of another FS. The FreeBSD code pre-stacks FFS on top of UFS, and that's the default FFS vnodeopv_entry_sec names fff_vnodeop_entries[]. The layer assembly happens when a VFS stack is instanced, which happens at mount time. The layers are set up via a call to vfs_register() (permit me an oversimplification). The main magic lies in the fact that if a layer doesn't know about an entry point, it doesn't touch it, it just passes the descriptor through unmodified to a lower layer, until someone does know about it, or until it hits the ENOENT layer, which doesn't pass anything through. This is why, in the second example, I can call the FS1 read, and it can call the lower layer read, and the FS2 read will just pass it through to the FS3 read, without causing an error. Note: It's possible to optimize this by instancing an FS2+FS3 collapsed vector. This would make the FS1 call directly to the FS3, skipping a "null" VOP passdown. FreeBSD doesn't do this, currently, so a "nullfs" will actually end up adding overhead and vnode pointer (vp) traversals, and allocating vnodes at the null layer for encapsulation of the requests to push them down to the next layer. Very inefficient. 8-(. So to recap: 1) There is a soft version of a coelesced vector stack, which lets the consumer see only the top level entry point for each function. 2) Things are passed by descriptor so that they are a single (usually "void") pointer, so that even though an intermediate layer doesn't know about the operation, it can still pass it on; without this, it would have to know about the arguments for the unknown function so that it could pop them off its call stack and push them back on the stack to call the underlying layer. This can't work. 2a) A size and type are included, so that descriptors can be proxied. This means they can be passed over a single network pipe, or passed to user space, to let people develope VFS stacking layers there, or proxied elsewhere. 3) There is a "hard stop" layer at the bottom. 4) There is a difference between "local media" FS's and "VFS stacking layer" FS's: a "local media" FS is very much the same as a "hard stop" layer, in that the calls can not be pushed down further. 5) The top level represents a consumer (in FreeBSD, there are two consumers: the system call interface, and the NFS server code). 6) There are a lot of optimizations that could be done that FreeBSD isn't doing. 6a) Interior vector collapse, to get rid of null VOPs in intermediate layers, reduce call overhead, and reduce resource wastage (e.g. vnodes per layer). 6b) List vector sorting of all known VOPs. This would permit instantiated stacks to save one lookup, one dereference, and one function call overhead, per call. 6c) Optimized direct vector handoff. This would let a stacking or implementation layer make a VOP call itself, without the descriptor unpack and repack that's currently necessary. It's all a pretty trivial assembly of data structures, if you understand the linkages. So it's not hard to understand, it's just not often explained properly for students. Mostly, students are expected to read the FICUS papers, and in particular, John Heidemann's Master's Thesis on VFS stacking architecture in FICUS: this is the same code that he and UCLA donated to CSRG, and which became the VFS stacking in BSD4.4, and thus FreeBSD. -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3B34FBC6.535C799B>