From owner-freebsd-fs@FreeBSD.ORG  Thu Sep 27 07:59:48 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 2423216A417
	for <fs@freebsd.org>; Thu, 27 Sep 2007 07:59:48 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id C240F13C457
	for <fs@freebsd.org>; Thu, 27 Sep 2007 07:59:47 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	[220.239.235.248])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8R7xhfe003034
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <fs@freebsd.org>; Thu, 27 Sep 2007 17:59:45 +1000
Date: Thu, 27 Sep 2007 17:59:43 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: fs@freebsd.org
Message-ID: <20070927175933.L770@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: deaadlock for large writes
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 27 Sep 2007 07:59:48 -0000

I'm getting deadlock with wait message "nbufkv" for
"dd if=/dev/zero of=zz bs=1m count=1000" to an msdosfs file system with
a block size of 64K.

There seems to be nothing except the buf_dirty_count_severe() hack to
prevent deadlock for large writes in general.  Large writes want to
generate a lot of dirty buffers using bdwrite() or cluster_write().
The vnode lock is normally held exclusively throughout VOP_WRITE() calls,
so there seems to be no way to complete the delayed writes until
VOP_WRITE() returns (since flushbufqueues() needs to hold the vnode
lock exlusively to write).

Deadlock is handled in some cases by the buf_dirty_count_severe() hack:
in just 2 file systems (ffs and msdosfs), in the main loop in VOP_WRITE(),
if (vm_page_count_severe() || buf_dirty_count_severe()), then bawrite()
is used to avoid creating [m]any more dirty buffers.  (I don't like
this because it makes the slow case even slower -- when the system
gets congested doing writes, we switch to a slower writing method so
the congestion will take even longer to clear.)  Some file systems use
the old pessimization of always writing complete blocks using bawrite(),
so they don't need to call buf_dirty_count_severe() but are slow even
without it.  msdosfs did that until recently when I implemented write
clustering in it.

buf_dirty_count_severe() just uses the dirty buffer count, so it can
only prevent buffer kva resource starvation by accident.  The accident
apparently doesn't happen with a block size of 64K (I think not just
for msdosfs -- this may be why large block sizes for ffs are considered
dangerous).  When the nbufkv deadlock occurred, the dirty count was
0x5d3 and bufspace is 0x5d30000 -- bufspace consisted entirely of the
0x5d3 dirty buffers each of size 0x10000.  hidirtycount was 0x709.
Since this is larger than 0x5d3, the buf_dirty_count_severe() hack
didn't help, but since it is not much larger, it almost helped.

ffs also has a buf_dirty_count_severe() check in ffs_update().  This is
missing in msdosfs and of course in all other file systems.  This is
less important that in the write loop, since at most one dirty buffer
can be generated per vnode.  But this check probably belongs in bdwrite()
itself, so that most file systems don't forget to do it and so that 
all file systems don't forget to do it before all their bdwrite()'s
except the ones in *fs_write() and *fs_update().  (ffs does about 50
bdwrite()'s without checking, mainly for indirect blocks and snapshots.)
I think it is safe to blindly turn bdwrite() into bawrite().  ffs_update()
actually blindly turns bdwrite() into bwrite() and returns the result,
but this seems wrong since it makes the congested case even more congested
than with bawrite(), for no advantage (callers shouldn't be checking the
result in the !waitfor case that can use bdwrite()).

Bruce