From owner-freebsd-hackers Thu Jan 7 18:49:30 1999 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id SAA17378 for freebsd-hackers-outgoing; Thu, 7 Jan 1999 18:49:30 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from smtp02.primenet.com (smtp02.primenet.com [206.165.6.132]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id SAA17373 for ; Thu, 7 Jan 1999 18:49:29 -0800 (PST) (envelope-from tlambert@usr01.primenet.com) Received: (from daemon@localhost) by smtp02.primenet.com (8.8.8/8.8.8) id TAA20897; Thu, 7 Jan 1999 19:48:59 -0700 (MST) Received: from usr01.primenet.com(206.165.6.201) via SMTP by smtp02.primenet.com, id smtpd020857; Thu Jan 7 19:48:51 1999 Received: (from tlambert@localhost) by usr01.primenet.com (8.8.5/8.8.5) id TAA03601; Thu, 7 Jan 1999 19:48:50 -0700 (MST) From: Terry Lambert Message-Id: <199901080248.TAA03601@usr01.primenet.com> Subject: Re: questions/problems with vm_fault() in Stable To: dillon@apollo.backplane.com (Matthew Dillon) Date: Fri, 8 Jan 1999 02:48:50 +0000 (GMT) Cc: tlambert@primenet.com, dyson@iquest.net, pfgiffun@bachue.usc.unal.edu.co, freebsd-hackers@FreeBSD.ORG In-Reply-To: <199901072306.PAA35328@apollo.backplane.com> from "Matthew Dillon" at Jan 7, 99 03:06:21 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG Now we deal with collapsing of layers: > :would be another example, where multiple NULLFS instances collapsed > :to *no* local vnode definitions, and one call boundary. Instead, > :you are suggesting that we instance vnodes in each NULLFS layer, > > You are assuming that these things are collapsable, but very *few* > VFS layers are actually collapseable. For example, there is no way > you could possibly collapse RAID or encryption layer or a mirroring > mid-layer. You can't collapse an MFS layer that is file-backed. > You can't collapse a mirror. You can't collapse a VN device due > to partition translations. In all cases the intermediate layers > can be independantly accessed and, in fact, it is *desireable* to > have the ability to independantly access them. The intent of using non-vnode originating layers is twofold: (1) It gets rid of the coherency issues we've discussed so far. (2) It allows for layer collapse, so that the virtual code path ends up being much smaller than the real code path. The first of these has been nearly discussed to death. Suffice to say that coherency problems come from complexity, and not all complexity has value, in and of itself. The second is a more intersting posit. Consider the case of where I stack 500 NULL stacking layers on top of a mount point. If each layer transition required a vnode translation, this would take a very long time. Well, what's a NULLFS? The NULLFS is primarily a kludge to allow relocation of directories in the filesystem hierarchy. This may at first seem to be a useful and necessary function. But in fact it's a function whose utility grows out of the implementation of directory mapping into the hierarchy in the first place in the per VFS mount routines. Because the mapping of a FS into the directory hierarchy is done at FS mount time, instead of in common upper level code, there are a number of consequences. Among these are: o You have to treat the root FS mount as a special case; this is necessitated by the need to remount root as rea/write using a device that may not be the same as the device provided in the boot procedure (it may, instead, be an alias -- of a different sort than the VM aliases we have previously discussed, in this case a device alias -- which owes more to the implementation of special devices as files in a "SPECFS" than it does to necessity). o There is an artificial distinction between a root mount and an inferior mount point mount. If FS's were not distinguished in this way, but instead kept in a global table, then a general (and therefore more reliable) set of routines could map from the table into the hierarchy. This also means that some FS's can be mounted as inferior FS's within the hierarchy, but *can't* be used as the root FS. o In order to map anying into the directory hierarchy, it has to be the root of a VFS instance. This is because to access the mapping mechanism, you must invoke it by way of some VFS-specific mount code into which it has been embedded. So if we resolve this, where is the utility of the NULLFS? It lies in its ability to act as a sample implementation of a minimal semantic VFS layer. Increasing this by requiring a vnode factory in the NULLFS, and VM alias objects for the underlying VM objects greatly complicates the minimal implementation. It also precludes layer collapse, unless it's predicated on the idea of the "default" VOPS being, in effect, a NULLFS themselves. How does collapsing work? Collapsing does *not* work, as implied in the discussion by Matt, a rune-time short circuit. Collapsing is intended to occur at FS mount time. When an FS is mounted, for every VOP in the descriptor array defined in the structure in (incorrectly compile-time generated) vnode_if.c, a VOP descriptor reference is instanced. [Note: if these descriptors are sorted, as well, then we can get rid of two ponter dereferences and a lot of reformatting glue code, as well, and reference by array offset instead of descriptor pointer reverse lookup]. For VOP's defined by a VFS, the descriptor is taken from the per VFS array of descriptors. For VOP's that *aren't* defined by a VFS, the descriptor is taken from the underlying VFS upon which it is stacked, and so on, until it gets to the bottom, where the VOP's that are substituted return a "not implemented" error. What does this mean for a stack of 500 NULLFS instances? What it means is that for most VOPs (all VOPs, if the VFS architecture wasn't currently screwed up by null_bypass and some ill-considered direct references to NULLVPTOLOWERVP), the VOP's inhereit from the bottom-most VFS! It *also* means that the overhead in figuring this out occurs at the time the VFS is instanced, *not* at runtime. So what's the overhead? 499 descriptor dereferences of 1 descriptor dereference? No. 1 descriptor dereference through the instanced VOPS array. How do we address the objection: > Introducing vnodes to the null stacking layer does not change the > coherency problems associated with the current VFS layering one > iota. You are, again, assuming that the coherency issue will be > magically solved by collapsing VFS layers and ignoring the fact > that most VFS layers (A) can't be collapsed, and (B) that your coherency > solution fails utterly the moment you take a network hop. We address it by noting that most VFS layers (A) *can* be collapsed, and (B) that the coherency issues for those that *can't* be collapsed, like the NFS client VFS, or the OTPFS, *don't* have real aliases, only virtual aliases. When the collapse occurs, what happens is that the *intermediate* no-op VOP's are collapes out, even if they have intervening VOP's that *can't be collapsed out. It is this inherent call-graph reduction which makes it worthwhile to stack a large number of semantic layers in the first place, and which makes it an error to introduce vnodes to layers which don't gain any benefit from having direct VM object references and/or don't need to support semantics for the underlying layers on an per-object basis (even then, layers that need this, such as an ACLFS layer, can "cheat" by file-based tunneling to get away from the requirement; this is, in fact, what UFS does when it puts its quota file references in the in core superblock on a per FS basis instead of in a hidden file in each directory on a per-file basis). > :Works on SunOS. Works on Solaris. If you have a source license, > :or sign non-disclosure, John Heidemann will show you the code. > > Explain to me how it works rather then point me at three hours worth of > research that I have to 'understand' to understand your point. No VOP_BYPASS is needed. Because this is introduced by BSD, BSD has these problems. You can see the reasoning (which is no longer valid) for the VOP_BYPAS in /sys/miscfa/nullfs/nullfs_vnops.c in front of nullfs_bypass(). > There are already a number of situations where coherency > tracking is desireable. Extending the model across a network > tops the list. Being able to use a coherent mmap() on a common > NFS-served partition from N different machines, for example. The MNFS code for FreeBSD from the David Sarnoff Center already addresses the issue of distributed cache coherency, and does it elegantly, without introducing a whole raft of complexity. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message