From owner-freebsd-arch@FreeBSD.ORG  Mon Mar 14 08:00:24 2005
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6C49816A4CF
	for <arch@freebsd.org>; Mon, 14 Mar 2005 08:00:24 +0000 (GMT)
Received: from mail.chesapeake.net (chesapeake.net [208.142.252.6])
	by mx1.FreeBSD.org (Postfix) with ESMTP id BF90143D3F
	for <arch@freebsd.org>; Mon, 14 Mar 2005 08:00:23 +0000 (GMT)
	(envelope-from jroberson@chesapeake.net)
Received: from mail.chesapeake.net (localhost [127.0.0.1])
	by mail.chesapeake.net (8.12.10/8.12.10) with ESMTP id j2E80Md4023990
	for <arch@freebsd.org>; Mon, 14 Mar 2005 03:00:22 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
Received: from localhost (jroberson@localhost)j2E80MR6023985
	for <arch@freebsd.org>; Mon, 14 Mar 2005 03:00:22 -0500 (EST)
	(envelope-from jroberson@chesapeake.net)
X-Authentication-Warning: mail.chesapeake.net: jroberson owned process doing
	-bs
Date: Mon, 14 Mar 2005 03:00:22 -0500 (EST)
From: Jeff Roberson <jroberson@chesapeake.net>
To: arch@freebsd.org
Message-ID: <20050314024439.G20708@mail.chesapeake.net>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Subject: filesystem suspension.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussion related to FreeBSD architecture
	<freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 14 Mar 2005 08:00:24 -0000

The current filesystem suspension mechanism suffers from a few asthetic
and functional problems.  I've been talking with Kirk about ways we could
replace it, and I'd like to propose a few of those ideas here to see if
anyone has useful criticism.  First, I'll briefly outline the problems.

There is the obvious problem of the rather cumbersome and error prone
addition of vn_start_write calls wherever you may write to the filesystem.
I keep finding places where they were not added when new code came in, or
were originally lacking.  It's just yet another call you have to remember
to make when dealing with vfs.

Furthermore, there is a real problem with vput(), which may cause
VOP_INACTIVE to be called, which may truncate.  To solve this, we
vn_start_write from within VOP_INACTIVE after we already have a lock held.
This is actually a lock order reversal, as the file system suspension acts
as a real lock.  rwatson has reported seeing this deadlock on a real
system.  To solve this, we could do the INACTIVE from another thread
which can call vn_start_write before relocking the vnode, but this would
serialize all file deletions!  I considered other mechanisms for this as
well, but they all have similar problems.

I have two basic proposals.  One is to handle all suspension from within
ffs's VOP_LOCK routine, the other is to handle all suspension from within
every vop that may write.

The ffs_lock method would move the suspension barrier into the ffs_lock
routine.  A thread would not be suspended if it already held a lockmgr
lock, and in this way it would be allowed to continue without leaving any
datastructures in an inconsistent state.  The suspension would proceed
once there were no outstanding ufs locks and all new callers would block
in ffs_lock.  This requires the least effort as virtually all of the code
would be in ffs_lock and unlock.  It would however prevent threads from
issuing read only calls for the duration of the suspension.

My second proposal involves gaiting threads within the actual writing
VOPs.  This would be similar to the vn_start_write mechanism, but it would
be contained entirely within ffs/ufs.  The big difference would be that
some threads would be suspended while holding locks so the snapshot would
have to run lockless, which could be done safely, or by using a special
locking protocol, like allowing it to recursively acquire locks that are
already held.  This would allow most read-only VOPs to continue, unless
they attempted to lock a vnode which was suspended in a writing vop.

Comments?  Other proposals?  I'd like to get this sorted out for 6.0.  I
may come up with some interim solution for RELENG_5 because the vrele
problem has caused deadlocks there.

Thanks,
Jeff