From owner-freebsd-fs Fri Dec 18 13:42:28 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id NAA04953 for freebsd-fs-outgoing; Fri, 18 Dec 1998 13:42:28 -0800 (PST) (envelope-from owner-freebsd-fs@FreeBSD.ORG) Received: from smtp04.primenet.com (smtp04.primenet.com [206.165.6.134]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id NAA04948 for ; Fri, 18 Dec 1998 13:42:27 -0800 (PST) (envelope-from tlambert@usr09.primenet.com) Received: (from daemon@localhost) by smtp04.primenet.com (8.8.8/8.8.8) id OAA27859; Fri, 18 Dec 1998 14:42:14 -0700 (MST) Received: from usr09.primenet.com(206.165.6.209) via SMTP by smtp04.primenet.com, id smtpd027680; Fri Dec 18 14:42:06 1998 Received: (from tlambert@localhost) by usr09.primenet.com (8.8.5/8.8.5) id OAA11441; Fri, 18 Dec 1998 14:41:55 -0700 (MST) From: Terry Lambert Message-Id: <199812182141.OAA11441@usr09.primenet.com> Subject: Re: nullfs bugs To: ezk@cs.columbia.edu (Erez Zadok) Date: Fri, 18 Dec 1998 21:41:55 +0000 (GMT) Cc: freebsd-fs@FreeBSD.ORG In-Reply-To: <199812181753.MAA05461@shekel.mcl.cs.columbia.edu> from "Erez Zadok" at Dec 18, 98 12:53:27 pm X-Mailer: ELM [version 2.4 PL25] MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-freebsd-fs@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.org > * nullfs for FreeBSD 3.0 > > When I started with nullfs on freebsd 3.0 (the May 98 snapshot) I found out > that it was not a complete file system. Some VFS operations were left > unimplemented, most notably the MMAP ones. I could mount nullfs, but trying > to do any MMAP operation (such as executing a binary), and the kernel > panics. Right. Here's the scoop. Right now in FreeBSD, a vnode is treated as a backing object, and a backing object is a mapping. This is a consequence of a unified VM and buffer cache. When you have a vnode stacked on another vnode, you have an aliasing problem to resolve: which vnode has the correct page information hung off of it? > ** Bugs in Nullfs [ ... in reverse order ... ] > (2) Getpages/Putpages: > > The second bug is even stranger. Initially, I had the implementation of > getpages and putpages call the same VOP on lowervp, with newly allocated > pages. But then under heavy loads I got obscure problems that seem to come > from deep inside UFS. It sometimes will return from ffs_getpages() (in > ufs_readwrite.c) with an invalid page, or one that's marked as deadc0de. I > tried to make sense of that ufs/ffs code, and I think that somewhere either > nullfs or the higher level vfs aren't locking or synchronizing something > they should be. Right. This is confusion about the backing object, per the above. > I "fixed" the problem with getpages, by implementing it using read(), so now > it works reliably, but with a suboptimal data access interface. > > Having implemented getpages() using read() forced me to implement > writepages() using write(), b/c otherwise the getpages and putpages didn't > seem to work well together (possibly b/c of interaction b/t [buffer] caches, > MMU, etc.) But recall that in order to solve bug #1, I made write() > synchronous. So now all putpages() have become synchronous as well. > > Like I said before, these fixes of mine are but workarounds. Some might > consider them hacks. But they do make nullfs fully functional at least. If > anyone has any idea how to fix this MMAP related bug, please let me know. These fixes will actually only work for a stack that is exactly one layer deep. This is because the lower_vp is the object off of which the pages are actually hung. If you were to use this on a nullfs on top of a nullfs, then you would probably see some errors (unless you implemented read in terms of VOP_GETPAGES). The reason for this is that your read is creating a copy of the data that is hung off the lower_vp, and then returning it to a user buffer. The problem here is that the top layer is going to issue a similar read to the middle layer, and it's going to fail because there is no backing object in the middle layer (only in the bottom layer). This can be brute-forced to work (I believe Tor Egge is the one who did this at one time?) by instancing a backing object in the intermediate layers. The reason this works with the read/write and not with the getpages and putpages is that you establish a copy instead of an alias. Using copies like this introduces cache corehency problems similar to those in a non-unified VM and buffer cache, and given the unification in FreeBSD, FreeBSD is pretty much totally unprepared to deal with maintaining coherency at this level, especially if a namespace is exposed to the user both above and below a stacking layer (e.g., with an ACL or cryptographic FS). The general soloution to this, which has been discussed by John Heidemann, John Dyson, Michael Hancock, Eivind Ecklund, Kirk McKusick, and myself at various times in the past is to get rid of the aliases. The only way to effectively do that is to provide a mechanism for an upper layer to ask for the vp of the backing object that's actually backing the vm, instead of the top level object. The main one that has been discussed is called VOP_GETFINALVP, or, more correctly, VOP_GETBACKINGVP. This can actually be implemented at low cost, since the only layer that really cares about doing the call is a layer with a VFS interface on both the top and the bottom. So it doesn't effect NFS client code (a VFS provider), the FFS code (a VFS provider, like all local media file systems), the NFS server code (a VFS consumer), or the system call layer (another VFS consumer). So basically, only the stacking layers take this hit, and then only in the case that they are doing data translation (crypto/compression) or object proxying. This is probably the best way to resolve this problem, since it hides the details of the VM implementation from the stacking layers. Even if you were to use a non-unified VM and buffer cache (e.g. SVR4), you would want to isolate the depedency on VM and buffer cache interaction so as to reduce the amount of system dependency in the code. So this is a win either way. > (1) Asynchronous writes: > > The vanilla nullfs has a serious bug where if you write a large file (3MB or > more) through it, several pages of the file are written as zeros to the > lower f/s. I tried various machines running freebsd 3.0, and different > disks and CPU speeds. In all cases I got the same data corruption. Yes. This is an alias problem, where the coherence between the upper and lower level objects are not being maintained. This happens because there is no read-before-write, as there would be with a normal FS block on FS blocksize boundaries. To confirm this, verify the size and offset of the corrupted extents (this should be a pretty trivial exercise). > The best "fix" I could find was to force the underlying write to happen > synchronously: > > error = VOP_WRITE(lower_vp, &temp_uio, (ioflag | IO_SYNC), cr); > > That solved the problem, but obviously it hurts write performance since now > all writes through nullfs have to be done synchronously, even for writing > one byte. Yeah. This is an explict synchronization, which happens to ensure cache coherency between the two backing objects, when there should only be one backing object. > My best guess for the reason for this bug is that there might be a race > condition b/t the file system and the buffer cache or even the MMU, and that > some sort of locking/synchronization is needed to avoid the race. Again, the answer is to avoid everything by explicit coherency, and the way to do it is to eliminate the aliases, and, in this particular case, the cached copies of partial data. > I'm familiar with the f/s code in freebsd, and have become very familiar > with the vfs/fs code in linux and solaris --- enough to know that this > freebsd bug is likely not the fault of my code. Alas, there are vast areas > of the rest of the kernel I'm not familiar with. I want to fix the bug > correctly if possible, and allow nullfs to write asynchronously, but I'm not > sure where to look at. Well, then you have to know then that the FreeBSD code is a hell of a lot more flexible and useful, if done right. 8-). These issues are pretty well understood, but there needs to be an architectural pass over the code with a view toward stacking. This has actually been my own pet hobby horse for at lease a number of (3) years now. It's to the point that enough people understand the issues and the problems that this is becoming a political possibility. > Frankly, I have a feeling that the two bugs I'm reporting here may be > related, and that fixing bug #1 would be easier and may impact the solution > to bug #2. Actually, #2 would be easiest, and would result in #1 being fixed as well, by eliminating the potential coherency race that comes from using the fault handler instead of an explicit copy (read). I'm going to be intentioanlly incommunicado for a while, as I'm going on vacation, but I'll probably break down and read my email once or twice, so if you have something needing immediate clarification, you can send me email, but I may not respond before the first of the year. Other people to contact who appear to be actively interested in solving these issues are Eivind Ecklund and Michael Hancock, so they may be good bets as well. Terry Lambert terry@lambert.org --- Any opinions in this posting are my own and not those of my present or previous employers. To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-fs" in the body of the message