From owner-freebsd-chat Sat Apr 3 13:16:43 1999 Delivered-To: freebsd-chat@freebsd.org Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (Postfix) with ESMTP id 7FF2514BD6 for ; Sat, 3 Apr 1999 13:16:38 -0800 (PST) (envelope-from tlambert@usr04.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id PAA27540; Sat, 3 Apr 1999 15:29:03 -0700 (MST) Received: from usr04.primenet.com(206.165.6.204) via SMTP by smtp04.primenet.com, id smtpd027528; Sat Apr 3 15:28:58 1999 Received: (from tlambert@localhost) by usr04.primenet.com (8.8.5/8.8.5) id OAA11040; Sat, 3 Apr 1999 14:14:33 -0700 (MST) From: Terry Lambert Message-Id: <199904032114.OAA11040@usr04.primenet.com> Subject: Re: Linux vs. FreeBSD: The Storage Wars To: toor@dyson.iquest.net (John S. Dyson) Date: Sat, 3 Apr 1999 21:14:32 +0000 (GMT) Cc: tlambert@primenet.com, dyson@iquest.net, hamellr@dsinw.com, unknown@riverstyx.net, freebsd-chat@FreeBSD.ORG In-Reply-To: <199904010647.BAA19102@dyson.iquest.net> from "John S. Dyson" at Apr 1, 99 01:47:20 am X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-chat@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > > Any putative efficiency penalties (granting their existance for the sake > > of discussion) would be paid only by the stacking layers themselves, and > > as it currently doesn't work, you aren't going to be paying an efficiency > > penalty for anything you currently use. > > > > So efficiency is a NULL argument. > > It cannot be a NULL argument, because continual polishing the t*rd isn't > really solving the problem. And Occam's Razor implies that "anything that works is better than anything that doesn't". As long as people cling to an "evolutionary, not revolutionary" mentality to keep themselves within thei comfort zone, you are going to be able to address architectural modifications until after you have a working architecture to modify. What you may view as "polishing the turd" is in fact the minimal set of work necessary to get to a point where evolution is possible. I personally prefer revolution, since it moves things ahead a hell of a lot faster, but I'm not in charge. > > IF VM alias objects are to be introduced (and that's a big mother "if", > > in my opinion), it should only be done *after* it is proven, using > > formal analysis methods, that unintentional aliases have been rendered > > impossible. > > The current VM backing scheme is correct and needs only minor extension. > In fact, the VM backing is natural (e.g. copy on write), whilst the > current VFS layering doesn't handle the needed semantics for coherency > without lots of traversal of the layers. I don't buy this traversal argument, and I don't buy the coherency argument. VFS stacking layers are translational, functional, or semantic. The semantic and functional layers don't need local cache, and can instead refer to the underlying layers instead. These two classes make up the *vast* majority of what you would ever want to do with VFS stacking. The translational layers need local cache. For things like a cryptographic or compressing FS, the local cache only *correlates* to, not replicates the contents of, the underlying layer. This means that the coherency issue is one of synchronization, and can not be eliminated by hand waving or by fiat. This leaves the small subset of translational layers in which the translations is (A) linear, (B) scoped to page-sized objects, and (C) scoped to page aligned boundaries (e.g., it leaves out useful linear translations such as an FS that represents the tracks on audio CD's as files). As far as I can tell, this set contains the single element "MFS backed by a vnode object as swap store". > Bottom line, the VM backing already DOES work, or nothing in the system > would work. Bottom line: FreeBSD machines in the field are expercing a statistically rare type of file corruption that has common characteristics, and FreeBSD versions includinging and prior to FreeBSD 2.2.6-RELEASE (*not* -STABLE) did not experience this corruption. Below the bottom line: It ain't gremlins, and it ain't pilot error on factory sealed units that don't offer shell access of any kind. > > The only way I see clear for this to happen is if they don't both > > exist in the code at the same time. > > Yep, get rid of the unintentional VFS layering bugs, by taking advantage > of the already needed VM layering for any kind of reasonable VM behavior. I'm referring to the existing VM bugs needing resolution before new VM bugs are (potentially) introduced by design changes. > That VM stuff is there anyway, so why muck it up with a parallel, and > semantically incorrect (or inefficient) structure? The VM layering already > has the needed mechanisms for handling shared (and modified) memory > "repositories." Files do not exist merely to serve the VM system. Files have abstract existance as modelled objects in user programs and ANSI C. Relatively, the VM has exposure as "the thing that makes mmap and sbrk work". Maybe there should be an entirely new, VM centric programming paradigm, but until one is ratified by an IEEE committee and given a POSIX ID number, I won't hold my breath. > By constraining onself to the current VFS layering, it simply complicates > the system with two different kinds of layering schemes. Don't forget > that sometimes generalization of a problem, simplifies it -- and the > VFS scheme is TOO conventionally-file oriented, and not very oriented > towards data. You mean like simultaneously having pty's, sockets, pipes, and FIFO's complicates things? Or like having CAM layering complicates things? I don't proget about problem generalization, but you should not forget that the major thorns in the VFS side are the *lack* of generalization in the BSD4.4 kernel with regard to file descriptor objects mapping to vnodes. > The "file" abstraction is too specific. I admit that the VM schemes > need to be better documented for those who haven't read the MACH > (and the new deamon book) information, but once the underlying principles > are understood, it is clear that files are a paradigm that are too > focused towards one kind of thinking. I understand that everyone's baby is the central and most important point of having a kernel in the first place. For some people, that's a realtime scheduler, and everything should be deterministic. For me, I'd at least like to move forward on the VFS front without having to reimplement everything in such a way as is most convenient for the VM system. I'm personally more concerned with dealing with the VM issues that exist, and being able to leverage VFS work taking place on other platforms. The second most important thing to me in this context is portability of VFS code between operating systems platforms, with the first being that the VFS code actually *work* on one or more platforms in the first place so that VFS actually takes place somewhere (can't leverage something without a fulcrum and a lever). > Such new documentation would mostly > be a repeat of already available materials anyway. But it would be specific, and free of all irrelevent and extraneous information that would get in the way of someone attempting to move forward while taking the constraints of the past into account. > As soon as a "file" is abstracted to "memory objects", then things become > easier. For people who lioke to treat everything as a memory object. Too bad user space code has to deal with files. > A memory object can reside anywhere, and have all kinds of > inheritance attributes, and interrelations. (A file can also, but the > scheme as presented in 4.4BSD VFS doesn't do so -- and to expand the > notion of file to what I call "memory objects", changes the current > layering code so severely as to make it better to almost start over.) I think inheritance of attributes is inherently evil. I prefer inheritance of semantics. If you inherit attributes, there is no accessor or mutator functions which are capable of being hooked to provide other interesting interpositions and notifications. It is more useful to me, writing a graphical file manager, for example, to be able to ask for notification when a directory changes, than it is for my MFS to be ten times instead of six times faster than the current MFS. > The Heidemann framework is a good document on the needed semantics > from a file standpoint, but addresses weakly the issues of the memory > objects (be they in memory, on disk, or across a network.) That's because a memory object is a cached copy of data from somewhere else. > With correct > protocols, the "memory object" scheme actually does what the programmer > expects. The current VFS layering framework only very weakly handle the > issues of the "data containers" or "memory objects" themselves. The > non-bidirectional nature of the current layering also forgets the > forward movement of OS design. (Of course, if every I/O call or access > to memory traverses the entire chain, then the current framework might work.) That's ridiculous. One of the design tenets, which Rosenthal didn't address in his vesrion of vnode stacking, is that null layers may be collapsed out. The current implementation is suffering in this regard *only* because there are default VOPs that are actually expected to do something other than return "not implemented". This is an error because of the case where you add a new VOP, and there is no longer a default VOP for every VOP: it assumes you recompile, then reinstance, everything. This flies in the face of the documented, intended, architecture. > The memory oriented approaches eliminate (or at least handle) the aliasing > and local caching issues correctly. This is really, really irrelevent. It assumes that all but a tiny fraction of all possible file systems will be using local media. If you don't solve the whole caching issue, you haven't solved the caching issue. The correct place for a coherency proptocol is in a seperate module that can, if need be, extended to encompass coherency between disparately located cache instances and/or clusters of such instances. This is obvious from the Sarnoff work in that regard. Most of you cache problems are stemming from an assumption that copies of the pages are hung off of multiple vnodes in a stack, and that this occurs because vnodes are vm_object_t containers. This is wrong from three perspecives: 1) You should *ask* the FS what the correct vm_object_t is, instead of dereferencing it out of a vnode directly. 2) Not all vnodes represent storage objects; many of them are containers for semantics, not data. 3) Those which do contain data *still* don't necessarily represent storage objects, but instead, translations of storage objects. It's *wrong* to turn a vnode into an alias for a vm_object_t. > The original 4.4/2 framework was so bad, > that even local mmaped objects are only weakly coherent (actually not even > that), let alone any other caching in the pipeline. With the memory > schemes, the problem solves itself (with only minor consideration for the > additional expected file semantics.) It is only the proper implementation > of VM coherency that the current code works local to a given vnode. It > is only a small VM extension, and definition for use, to make an entire > layered scheme work. I'll believe this when msync(2) becomes #define msync(x,y,z) /* nothing*/. > By reworking the entire VFS layering scheme (still looking somewhat > like the current implementation, but properly abstracted) the entire > solution (instead of a hack solution) can be made available. I agree that the current code needs changes. I just don't think it's a good idea to trade one experiment that doesn't work because the effort hasn't been expended to put in the fixes for another experiment that still not guaranteed to work, and still doesn't have the effort behind it. I believe the correct approach is to stabilize the existing code in line with it's original design document. > Remember, both FILE and MEMORY data needs to be presented to the user, and > FILE data is a narrow picture of memory. MEMORY can easily be made more > specific by presenting it as a file -- however expanding the semantics of > a file to memory is more complex (especially with sharing.) When a > conversion to MEMORY from FILE and back again, has to be done at every > layer, then a scheme is going to be very inefficient or complex. If the > abstraction is kept as memory at each layer, then complexities are lessened. First off, file data is a hell of a lot more prevalent than memory data. Everything on the entire net can be represented as file data, and even if you made a heroic effort, you simply don't have the bits to represent it as memory. Memory data must be viewed as cached copies of file data, with the potential that there is no file data backing a particular object (in other words, it's a cached copy of very volatile file data, and only the cached copy is useful). Second, as pointed out above, you only have to do the conversion at layers where you imply a relationship between a vnode (file data) and a vm_object_t (memory data). This relationship need not be implied *everywhere*; it need only be implied for: 1) The terminal backing vnode itself 2) Translational layers, where the translational layer vnode itself contains a cached version of the translated data from the terminal backing object (Semantic layers, obviously, don't need vm_object_t's at all, since they don't have persistant data). In the first case, there's not a coherency issue because there's no "ghosts". In the second case, there's a coherency issue, but it's an issue that *must* be handled by layer itself, since it contains the code that understands the translation process (and its reverse, if any). This means that if you want to implement a cryptographic or a compressing FS, *you have no choice* but to implement the code for VOP_{GET|PUT}PAGES, and let the VFS layer be responsible for the semantics. The same goes for VFS layers like the NFS client. The problems *only* occur when you *insist* that a non-terminal vnode have a vm_object_t. It is this insistance which is flawed, not the architecture that can't serve this insistance. > Since each layer might have to present a memory image (either as caching > or mmap), then with a file representation, each layer has to do the "hard" > conversion (given the anachronisitic file-only abstraction.) There is NO cost > in keeping the abstraction as memory as long as possible in the chain. If > a conversion is needed at machine boundaries, it might be possible to > avoid the file abstraction entirely, and create a (MEMORY <--> SOCKET) > protocol directly. (It might not be needed to create and use a more complex > (MEMORY <--> NFS <--> SOCKET) thing.) This is an assumption that I believe has very little basis: that most intermediate VFS stacking layers will have exposure in the filesystem namespace. You would *not* want this for cryptographic, ACL, file forking, auditing, monitoring, activating, event generating, quota enforcing, versioning, etc., etc. layers for VFS's underneath them. It would totally defeat their utility. The only cases where I can see this being useful are union, transparent, overlay, whiteout, and similar situations. These situations are relatively unique for several reasons: o They generally apply to multiple exposure of *the same layer* in different places in the namespace, and thus there is no cache coherency issues. o Where they *do* expose both the VFS and the VFS it is stacked upon to the namespace, the stacked VFS is a hybrid semantic and translational VFS layer. The interesting thing to note about hybrids is that they store their data using either structural or namespace escapes. In either case, their data objects are effectively "exclusive use". It would be a trivial extension to allow an upper layer to tell a lower layer to "guard these objects from change by anyone but me". Do I want to rewrite the VFS layering so that the bottom end is not an interface to the VM system at all, but instead a VFS representation of a variable granularity block store, which itself interfaces to the VM on behalf of all other VFS layers? Yes. Do I *need* to do this? For some applications, it'd be damn convenient to have the VFS architecture be symmertic for *all* VFS's, but it's not a requirement to make the current architecture useful. > > > Why do you put words in my mouth about doubling inode size? Straw man... > > > > You are mentioning ACL's. The most current FS ACL work is being done > > in NetBSD (not FreeBSD). I thought you were referencing a modern > > research project when you referenced ACL's. My mistake. > > Yep... By assuming what I have been thinking about, it shows that arguments > about such might be misguided. Sorry. I was recently approached about the work. I should have looked for zebras when I heard the ACL hoofbeats. ;-). > > Fie. You are the one who originally posted about seeing years of work > > frittered away. I am not prepared to repeat that journey; it is a fool's > > quest. > > Fallacious argument -- you aren't the author of the original code or > those changes are you? The author of the code apparently accepted the > changes. (In fact, the changes were also compatible with other users > and developers on the codebase.) I'm the author of at least two instances of the fixes that resolve the VFS stacking issues, both of which were "frittered away". [ ... on storing 64 bits of seconds for ctime, atime, and mtime ... ] > I suggest coming up with a solution then. Of course, I suggest that > UFS/ODS2 needs to be thought through. By taking micro pot-shots doesn't > really solve the problem (or the other problems that needed to be solved > in the shorter term.) Here's a soloution: 1) Back out the use of those fields for nanoseconds 2) Zero the coopted fields. This is the painful part, but you can use a bit in fs_unused_1 if you have to, or you could be truly sneaky and use inode 1. My preference would be adding an fs_flag value. However it's done, pennance for the sin must be paid. 3) Take a spare field for nanoseconds for *just* mtime 4) Modify make to look there instead of the mtime element of the coopted fields 5) Modify the quota code to put in a header containing a magic number and a version number in the frigging file so that we can tell the difference between old and new files at mount time. 6) Make time_t an int64_t instead of an int32_t in the new file format, and eat the performance hit with the old file format. 7) While you are at it, remove the 32 bit limitation on the block count. 8) Add an option to quota(1) to upgrade the file. 9) Know in your heart of hearts that quotas should be implemented as a VFS stacking layer using a namespace override so that quotas can work on any filesystem, not just UFS. 10) Find someone willing to commit the code before you write it. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message