Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 5 Sep 2000 18:02:19 +0700 (ALMST)
From:      Boris Popov <bp@butya.kz>
To:        freebsd-fs@freebsd.org
Cc:        dillon@freebsd.org, semenu@freebsd.org, tegge@freebsd.org
Subject:   CFR: nullfs, vm_objects and locks... (patch)
Message-ID:  <Pine.BSF.4.10.10009051705530.79991-100000@lion.butya.kz>

next in thread | raw e-mail | index | archive | help
	Hello,

	Last few days I've spent trying make nullfs really functional and
stable. There are many issues with the current nullfs code, but below I'll
try to outline the most annoying ones.

	The first one, is an inability to handle mmap() operation. This
comes from the VM/vnode_pager design where each vm_object associated with
a single vnode and vise versa. Looking at the problem in general one may
note, that stackable filesystems may have either separated vm_object per
layer or don't have it at all. Since nullfs essentially maps its vnodes to
underlying filesystem, it is reasonable to map all operations to
underlying vnode.

	The above goal can be reached in two ways (at least):

	1. Create a special vm pager (call it null_pager) which will map
all operations to underlying vm_object.

	2. Give each filesystem control over vm_object handling (eg,
creating/destroying and access).

	I've played with both variants and later seems to be more simple
to implement. To do this we need three additional VOPs:

	VOP_CREATEVOBJECT(struct vnode *vp);
	VOP_DESTROYVOBJECT(struct vnode *vp);
	VOP_GETVOBJECT(struct vnode *vp, struct vm_object **obj);

	First operation lets filesystem to create its own vm object,
second destroy it and third - return the reference to the real vm_object.
The rest of VFS/BIO/VM code should be adapted to not access v_object field
directly, but use VOP_GETVOBJECT() call. In this way each layer can easily
achieve cache coherency without any additional code. For example nullfs
just returns reference to underlying vm_object and the rest of the code
flows as usually. It is important that the above way gives full coherency
no matter which layer modified data.

	The second big issue for vnode-based stacks of filesystems is a
vnode state synchronization and tightly related problem - vnode locking.
This is really weird problem but NetBSD guys seems to solve locking part
of it. So I'll just explain their way because it works and works fine.

	At this moment vnode locking mechanism looks not quite good due to
the lack of unification. nullfs requires strict synchronization of locking
states across all layers, so each underlying filesystem should provide 
access to vnode locking structure and have ability to share it. Currently,
each filesystem can either allocate its own lock structure or let VFS
allocate one. This makes inconsistency with vnode.v_vnlock handling.

	The simple solution for it is to integrate lock structure into
vnode and making vnode.v_vnlock initially point to it. Any filesystem
above can pickup the pointer and assign it to v_vnlock field in its own
vnode structure. If underlying filesystem doesn't provide any lock
structure then bad times are coming and each layer can maintain its own
lock state.

	As I said before, proper locking of vnodes is also tightly related
and requires VFS changes to provide necessary information. This
information is very important in the VOP_LOOKUP() functions because here
all sorts of deadlocks can occur. This is done via new flag (PDIRUNLOCK)
in the namei structure which should be carefully maintained by underlying
filesystem when it locks or unlocks parent vnode. I've adapted both VFS
and UFS code to handle it looking at how NetBSD done this task.

	Ok, I think it is enough for introduction and here is a link to
the patch against recent -CURRENT:

	http://www.butya.kz/~bp/n9sys.diff

	With this patch I was able to do a 'make -j4 world' on top of
the nullfs mounts as well as some mmap() related tests.

	Comments are welcome.

P.S. Two hours ago Sheldon Hearn told me that Tor Egge and Semen Ustimenko
worked together on the nullfs problem, but since discussion were private I
didn't know anything about it and probably stepped on their to toes with
my recent cleanup commit :(
-- 
Boris Popov http://www.butya.kz/~bp/



To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-fs" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?Pine.BSF.4.10.10009051705530.79991-100000>