From owner-freebsd-arch@FreeBSD.ORG  Thu Jun 19 01:25:37 2003
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id DBCD637B401; Thu, 19 Jun 2003 01:25:37 -0700 (PDT)
Received: from stork.mail.pas.earthlink.net (stork.mail.pas.earthlink.net
	[207.217.120.188])	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 0DC9343FD7; Thu, 19 Jun 2003 01:25:36 -0700 (PDT)
	(envelope-from tlambert2@mindspring.com)
Received: from user-2ivfk2f.dialup.mindspring.com ([165.247.208.79]
	helo=mindspring.com)
	by stork.mail.pas.earthlink.net with asmtp (SSLv3:RC4-MD5:128)
	(Exim 3.33 #1)	id 19Suix-0007Eg-00; Thu, 19 Jun 2003 01:24:52 -0700
Message-ID: <3EF172EF.1248AD97@mindspring.com>
Date: Thu, 19 Jun 2003 01:23:11 -0700
From: Terry Lambert <tlambert2@mindspring.com>
X-Mailer: Mozilla 4.79 [en] (Win98; U)
X-Accept-Language: en
MIME-Version: 1.0
To: The Hermit Hacker <scrappy@hub.org>
References: <20030618112226.GA42606@fling-wing.demos.su>
	<20030618121620.GG835@starjuice.net> <20030618202302.W51411@hub.org>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-ELNK-Trace: b1a02af9316fbb217a47c185c03b154d40683398e744b8a4480afc112eafc866ef1b3e2e8a97c640a2d4e88014a4647c350badd9bab72f9c350badd9bab72f9c
cc: Dmitry Sivachenko <demon@FreeBSD.org>
cc: Poul-Henning Kamp <phk@phk.freebsd.dk>
cc: "Tim J. Robbins" <tjr@FreeBSD.org>
cc: arch@FreeBSD.org
Subject: Re: cvs commit: src/sys/fs/nullfs null.h null_subr.c null_vnops.c
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: fs@freebsd.org
List-Id: Discussion related to FreeBSD architecture
	<freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 19 Jun 2003 08:25:38 -0000

The Hermit Hacker wrote:
> 'K, this kinda hurts ... there are a growing # of us that are actually
> using unionfs and nullfs on production systems ... not small servers, but
> several thousand processes with over 100 union mounts ... other then the
> vnode leak stuff that David has been investigating, I've yet to see
> anything that I would considering warranting the 'DO NOT USE / CAVEAT
> EMPTOR' that is in the man pages ... :(

Use mmap on a bunch of files on a nullfs, and don't do msync()
to perform an explicit coherency cycle.  Modofiy the original
underlying files.  Do this for different areas of partial pages
on both the nullfs and the FS the nullfs is covering.

1)	There is no explicit coherency notification to the
	covering FS when the covered FS's vnode data is
	modified.

2)	There is no explicit coherency cycle for mapped pages
	when a write occurs, if the page being written is in
	core.

Basically, in order to support this, you will have to unmap the
pages for write, take the fault, and then restart the write with
the knowledge that you need to trigger a write-through (or a
write-back) as a result of having triggered the fault: in other
words, an explicit coherency cycle.

The current nullfs code avoids this by having a 1:1 page mapping
and using a trick I came up with, which is to get the underlying
vm_object_t from the underlying vnode, instead of the nullfs
vnode.  But it pays a rather large performance penalty.


The other problem is that it gives the wrong impression about
FS stacking in FreeBSD: it give the impression that it works
in other than the specialized contrived case of nullfs.

This does not (and can not) work with transformative stacking
layers, such as a crypto stacking layer, a character set
translation stacking layer (e.g. a Koi-8 FS NFS mounted on an
ISO-8859-1 Locale system, which needs the Koi-8 data UTF-8
encoded before it can be displayed in a file browser), and a
number of other layers.

The page trick suggested above also fails in some cases; for
example, consider the case where you have a very fast disk
for the first 2K of each file, and a slower disk for the
remainder of each file (if any).  The data break spans a page
boundary, and therefore you can't deal with it.

In a similar vein, if you proxy your VOP descriptors to another
address space, you are screwed, because vnodes are assumed to
contain vmobject_t's, and these are assumed to be locally
accessible to the address space in question (how do you implement
a VOP_GETVOBJECT() when the vnode you are referencing is in user
space?  Is on another node?  Etc.?).

Paging VOPs almost need an internal payload of a page or page
set, coupled with an address space descriptor, in order to let
them know if the called party can access them directly, rather
than needing to call a rendevous data copy operation.

If you read John Heidemann's Master's thesis (ftp.cs.ucla.edu),
or the Ficus documentation (same FTP server), which are the
basis of the stacking vnode framework in BSD4.4-Lite2, and thus
in FreeBSD, you'll see that these problems have already got
answers, they just aren't being implemented in FreeBSD, and as
FreeBSD moves further from the original intended design, it's
only going to get harder to recover the functionality.

Really, the stacking in FreeBSD today is pretty much a toy.  The
reason FFS can stack on UFS is that the VOP's that are being
exported are not really stacked, because they represent two
non-intersecting set of VOP's: one is for a flat numeric namespace
(inode numbers) FS, called UFS (or UFS2, or also... formerly..
MFS), and the upper layer FFS implements a hierarchical namespace
in the context of the underlying flat numeric namespace.

There are a couple of interesting things you can do without really
stacking (causing the VOP namespaces to intersect, thus introducing
the coherency issue); one of these would be to seperate out the
disk quota interface.  With the exception of the quota VOP that's
needed, everything else is non-intersecting in the same way that
the nullfs is non-intersecting: there's no upper layer vmobject_t
reference needed to implement it.  Combine that with the VOP for
the quota control operations being non-intersecting in the VOP
namespace (like the VOP for directory operations not being in the
UFS namespace), and you have sufficient seperation to implement
quotas in the context of a decoherent stacked cache, because you
never need to reference bth the upper and lower vnode's vmobject_t
for a given particular vnode.

But the FreeBSD implementation is probably far from useful, without
the coherency notification mechanisms for "upper dirty/write through
to lower" and "lower dirty/invalidate upper cached copy".  Those just
aren't there, and the framework totally lacks the necessary semantics
for the second one, at the present time.

There are a number of deadlock issues in the unionfs case; most
people don'y use that, and use the union mount option, which is
not the same thing at all.  Most of these problems are centered
around things like relookup, etc., which have to drop and then
reacquire a lock to avoid an internal deadlok (e.g. "rename");
by doing this, they open a small race window, in which it's
possible, with the right call-path pressure, to create a deadlock
between concurrently executing threads of control.  The window
is much more pronounced on SMP systems, which are statistically
much more likely to hit it.

Followups set to Freebsd-FS.

-- Terry