From owner-freebsd-fs@FreeBSD.ORG  Tue Apr 25 12:56:15 2006
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
X-Original-To: freebsd-fs@freebsd.org
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0045B16A402;
	Tue, 25 Apr 2006 12:56:14 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1437343D7B;
	Tue, 25 Apr 2006 12:56:03 +0000 (GMT)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 54D5846B1E;
	Tue, 25 Apr 2006 08:56:02 -0400 (EDT)
Date: Tue, 25 Apr 2006 13:56:02 +0100 (BST)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Daichi GOTO <daichi@freebsd.org>
In-Reply-To: <444E13BA.8050902@freebsd.org>
Message-ID: <20060425133412.V51337@fledge.watson.org>
References: <E1F5gbI-000Eea-B7@cs1.cs.huji.ac.il>
	<43E5D052.3020207@freebsd.org>
	<43E656C7.8040302@freesbie.org> <43E6D5C8.4050405@freebsd.org>
	<43E71485.5040901@freesbie.org> <43E73330.8070101@freebsd.org>
	<43EB4C00.2030101@freebsd.org> <4417DD8D.3050201@freebsd.org>
	<4433CA53.5050000@freebsd.org> <444E13BA.8050902@freebsd.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: ozawa@ongs.co.jp, freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org,
	freebsd-current@freebsd.org, Alexander@Leidinger.net
Subject: Re: [ANN] unionfs patchset-11 release
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 25 Apr 2006 12:56:15 -0000

On Tue, 25 Apr 2006, Daichi GOTO wrote:

>   Changes in unionfs-p11.diff
>     - Changed a few implementations around the lock/unlock
>       mechanism. Because of this, you can use both the unionfs
>       and the nullfs together without LK_CANRECURSE.
>     - Fixed a bug that sometimes does not unlock if it cannot
>       create shadow file.

First off, thanks again for working on this!

>  If someone knows the details of vnode's lock status via
>  VOP_GETWRITEMOUNT, Please teach us (daichi, ozawa). We
>  want to know the details.

Basically, file systems supporting full file system snapshots (UFS) provide a 
mechanism to "lock out" writers before they enter VFS so that they don't end 
up holding write locks for long periods, leading to deadlock. 
vn_start_write() is called to notify the file system that a thread is about to 
enter the file system for a write, and vn_write_finished() is called to notify 
the file system it is done.  In effect, it's a giant reader-writer lock, which 
allows multiple readers and multipler writers, except during snapshot 
generation, when it blocks new writers until the snapshot is generated.

In general, you'll notice two sorts of logic around calls to vn_start_write(): 
a first set, where vn_start_write() is called once holding a vnode reference, 
is acquired, and then things continue as normal, with a final 
vn_finished_write() call at the end.  In this situation, vnode locks are 
acquired after the vn_start_write() call, but vnode references are held before 
(since vn_start_write() takes a vnode so that it can find the file system).

The other circumstance is where vnode locks may already be held, in which case 
a non-sleeping acquire is performed, since in effect this is a violation of 
lock order.  If it fails, the vnode lock is released, the reference is 
acquired, and then the whole operation is restarted so that we can try again 
to acquire the vnode lock under circumstances where file system snapshot lock 
can be safely acquired.  So basically, it has deadlock detection and recovery 
logic.  The V_XSLEEP lock basically says "Sleep until the snapshot lock would 
be available, then return", which loops back so we can re-try the acquires.

So according to the above, the file system snapshot lock is *before* the vnode 
locks in the lock order, although in practice we acquire in any order as long 
as it won't lead to deadlock (in which case we recover).  The logic here is a 
little shaky in practice -- among other things, it looks like potentially the 
mount point could go away during the call to vn_start_write() once the vnode 
is released in the deadlock detection code, but in practice this probably 
never happens.

Notice that the above is all couched in terms of a single file system, not 
stacking.  This is probably because it was all written with UFS and not 
stacking in mind.

Robert N M Watson