From owner-freebsd-arch@FreeBSD.ORG Thu Jun 19 01:25:37 2003 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DBCD637B401; Thu, 19 Jun 2003 01:25:37 -0700 (PDT) Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net [207.217.120.188]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0DC9343FD7; Thu, 19 Jun 2003 01:25:36 -0700 (PDT) (envelope-from tlambert2@mindspring.com) Received: from user-2ivfk2f.dialup.mindspring.com ([165.247.208.79] helo=mindspring.com) by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128) (Exim 3.33 #1) id 19Suix-0007Eg-00; Thu, 19 Jun 2003 01:24:52 -0700 Message-ID: <3EF172EF.1248AD97@mindspring.com> Date: Thu, 19 Jun 2003 01:23:11 -0700 From: Terry Lambert X-Mailer: Mozilla 4.79 [en] (Win98; U) X-Accept-Language: en MIME-Version: 1.0 To: The Hermit Hacker References: <20030618112226.GA42606@fling-wing.demos.su> <20030618121620.GG835@starjuice.net> <20030618202302.W51411@hub.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4480afc112eafc866ef1b3e2e8a97c640a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c cc: Dmitry Sivachenko cc: Poul-Henning Kamp cc: "Tim J. Robbins" cc: arch@FreeBSD.org Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: fs@freebsd.org List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 Jun 2003 08:25:38 -0000 The Hermit Hacker wrote: > 'K, this kinda hurts ... there are a growing # of us that are actually > using unionfs and nullfs on production systems ... not small servers, but > several thousand processes with over 100 union mounts ... other then the > vnode leak stuff that David has been investigating, I've yet to see > anything that I would considering warranting the 'DO NOT USE / CAVEAT > EMPTOR' that is in the man pages ... :( Use mmap on a bunch of files on a nullfs, and don't do msync() to perform an explicit coherency cycle. Modofiy the original underlying files. Do this for different areas of partial pages on both the nullfs and the FS the nullfs is covering. 1) There is no explicit coherency notification to the covering FS when the covered FS's vnode data is modified. 2) There is no explicit coherency cycle for mapped pages when a write occurs, if the page being written is in core. Basically, in order to support this, you will have to unmap the pages for write, take the fault, and then restart the write with the knowledge that you need to trigger a write-through (or a write-back) as a result of having triggered the fault: in other words, an explicit coherency cycle. The current nullfs code avoids this by having a 1:1 page mapping and using a trick I came up with, which is to get the underlying vm_object_t from the underlying vnode, instead of the nullfs vnode. But it pays a rather large performance penalty. The other problem is that it gives the wrong impression about FS stacking in FreeBSD: it give the impression that it works in other than the specialized contrived case of nullfs. This does not (and can not) work with transformative stacking layers, such as a crypto stacking layer, a character set translation stacking layer (e.g. a Koi-8 FS NFS mounted on an ISO-8859-1 Locale system, which needs the Koi-8 data UTF-8 encoded before it can be displayed in a file browser), and a number of other layers. The page trick suggested above also fails in some cases; for example, consider the case where you have a very fast disk for the first 2K of each file, and a slower disk for the remainder of each file (if any). The data break spans a page boundary, and therefore you can't deal with it. In a similar vein, if you proxy your VOP descriptors to another address space, you are screwed, because vnodes are assumed to contain vmobject_t's, and these are assumed to be locally accessible to the address space in question (how do you implement a VOP_GETVOBJECT() when the vnode you are referencing is in user space? Is on another node? Etc.?). Paging VOPs almost need an internal payload of a page or page set, coupled with an address space descriptor, in order to let them know if the called party can access them directly, rather than needing to call a rendevous data copy operation. If you read John Heidemann's Master's thesis (ftp.cs.ucla.edu), or the Ficus documentation (same FTP server), which are the basis of the stacking vnode framework in BSD4.4-Lite2, and thus in FreeBSD, you'll see that these problems have already got answers, they just aren't being implemented in FreeBSD, and as FreeBSD moves further from the original intended design, it's only going to get harder to recover the functionality. Really, the stacking in FreeBSD today is pretty much a toy. The reason FFS can stack on UFS is that the VOP's that are being exported are not really stacked, because they represent two non-intersecting set of VOP's: one is for a flat numeric namespace (inode numbers) FS, called UFS (or UFS2, or also... formerly.. MFS), and the upper layer FFS implements a hierarchical namespace in the context of the underlying flat numeric namespace. There are a couple of interesting things you can do without really stacking (causing the VOP namespaces to intersect, thus introducing the coherency issue); one of these would be to seperate out the disk quota interface. With the exception of the quota VOP that's needed, everything else is non-intersecting in the same way that the nullfs is non-intersecting: there's no upper layer vmobject_t reference needed to implement it. Combine that with the VOP for the quota control operations being non-intersecting in the VOP namespace (like the VOP for directory operations not being in the UFS namespace), and you have sufficient seperation to implement quotas in the context of a decoherent stacked cache, because you never need to reference bth the upper and lower vnode's vmobject_t for a given particular vnode. But the FreeBSD implementation is probably far from useful, without the coherency notification mechanisms for "upper dirty/write through to lower" and "lower dirty/invalidate upper cached copy". Those just aren't there, and the framework totally lacks the necessary semantics for the second one, at the present time. There are a number of deadlock issues in the unionfs case; most people don'y use that, and use the union mount option, which is not the same thing at all. Most of these problems are centered around things like relookup, etc., which have to drop and then reacquire a lock to avoid an internal deadlok (e.g. "rename"); by doing this, they open a small race window, in which it's possible, with the right call-path pressure, to create a deadlock between concurrently executing threads of control. The window is much more pronounced on SMP systems, which are statistically much more likely to hit it. Followups set to Freebsd-FS. -- Terry