From owner-freebsd-fs  Mon Sep 11 14:55:45 2000
Delivered-To: freebsd-fs@freebsd.org
Received: from smtp01.primenet.com (smtp01.primenet.com [206.165.6.131])
	by hub.freebsd.org (Postfix) with ESMTP
	id 4B8B437B43C; Mon, 11 Sep 2000 14:55:40 -0700 (PDT)
Received: (from daemon@localhost)
	by smtp01.primenet.com (8.9.3/8.9.3) id OAA16115;
	Mon, 11 Sep 2000 14:54:59 -0700 (MST)
Received: from usr09.primenet.com(206.165.6.209)
 via SMTP by smtp01.primenet.com, id smtpdAAAoMaOCF; Mon Sep 11 14:54:54 2000
Received: (from tlambert@localhost)
	by usr09.primenet.com (8.8.5/8.8.5) id OAA18763;
	Mon, 11 Sep 2000 14:55:27 -0700 (MST)
From: Terry Lambert <tlambert@primenet.com>
Message-Id: <200009112155.OAA18763@usr09.primenet.com>
Subject: Re: CFR: nullfs, vm_objects and locks... (patch)
To: bp@butya.kz (Boris Popov)
Date: Mon, 11 Sep 2000 21:55:27 +0000 (GMT)
Cc: freebsd-fs@FreeBSD.ORG, dillon@FreeBSD.ORG, semenu@FreeBSD.ORG,
	tegge@FreeBSD.ORG
In-Reply-To: <Pine.BSF.4.10.10009051705530.79991-100000@lion.butya.kz> from "Boris Popov" at Sep 05, 2000 06:02:19 PM
X-Mailer: ELM [version 2.5 PL2]
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Sender: owner-freebsd-fs@FreeBSD.ORG
Precedence: bulk
X-Loop: FreeBSD.org

> 	Last few days I've spent trying make nullfs really functional and
> stable. There are many issues with the current nullfs code, but below I'll
> try to outline the most annoying ones.
> 
> 	The first one, is an inability to handle mmap() operation. This
> comes from the VM/vnode_pager design where each vm_object associated with
> a single vnode and vise versa. Looking at the problem in general one may
> note, that stackable filesystems may have either separated vm_object per
> layer or don't have it at all. Since nullfs essentially maps its vnodes to
> underlying filesystem, it is reasonable to map all operations to
> underlying vnode.

I had a similar approach, which uses only one additional call:

	struct vnode *VOP_FINALVP(struct vnode *vp);

When called on a vnode, it returns the real backing object, instead
of a higher level shadow in a stack.  Upper level vnodes do not have
backing store associated with them.

My approach, and the one you have put forward, are both flawed, if
you try to move beyond the simple case of a 1:1 correspondance
between stacking layers and underlying objects.

That is, if we have anything more complex than a page in the final
disk image equalling a page in a process address space, then there
is a need for intermediate backing object(s).

The most obvious case for this would be a compressing stacking
layer, where the backing pages and the process address space pages
are algorithmically related, but not identical.  Similar cases to
this one are metadata stuffing (say you take the first 1k of the
file for an intermediate layer to enable access control lists, etc.),
cryptographic stacks, and transformational stacks (example: an NFS
client that maps 8859-1 files into 16 bit Unicode data, transparently).


It seems to me that a hybrid approach is required, with explicit
coherency calls between layers, at least for the non-correspondance
cases, and with something like your approach (or mine) as an
optimization, for the simple case.

What this means is putting some of the pre-unified VM and buffer
cache synchronization points back into the VFS consumer layers:
the system call layer, and the NFS client layer.


The simplest approach to resolving this is to provide a pager that
implements VOP_{GET|PUT}PAGES using the read and write primitives;
this would be used in intermediate layers which have their own
backing objects in buffer cache/swap, but no disk backing object in
an on-disk file system.


> P.S. Two hours ago Sheldon Hearn told me that Tor Egge and Semen Ustimenko
> worked together on the nullfs problem, but since discussion were private I
> didn't know anything about it and probably stepped on their to toes with
> my recent cleanup commit :(

The code which I have seen on this subject works using the explicit
coherency synchronization between backing objects.  Unlike the approach
in your patches, there is a duplicate backing object.  It was my
understanding that there was a cache coherency issue for devices
that may be mounted after having a null layer stacked on them;
specifically, the devices are vnodes, and have their own vm_object_t
associated with them, and thus their own pages.  From playing around
with the patches Tor Egge had provided, I was able to demonstrate
coherency failures in a number of circumstances, and it was not at all
clear to me that msync() and fsync() would operate as expected.  I was
able to cause a number of supposedly "synchronized" file systems to
fail, one catastrophically (doing a shutdown of a system with a nullfs
mounted over /dev, with an FS named /A mounted on a device visible
through the nullfs) when it spammed my root partition (not the /A
partition!).


					Terry Lambert
					terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.


To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message