From owner-freebsd-fs@FreeBSD.ORG Thu Sep 27 07:59:48 2007 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2423216A417 for ; Thu, 27 Sep 2007 07:59:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id C240F13C457 for ; Thu, 27 Sep 2007 07:59:47 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8R7xhfe003034 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 27 Sep 2007 17:59:45 +1000 Date: Thu, 27 Sep 2007 17:59:43 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Message-ID: <20070927175933.L770@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: deaadlock for large writes X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 27 Sep 2007 07:59:48 -0000 I'm getting deadlock with wait message "nbufkv" for "dd if=/dev/zero of=zz bs=1m count=1000" to an msdosfs file system with a block size of 64K. There seems to be nothing except the buf_dirty_count_severe() hack to prevent deadlock for large writes in general. Large writes want to generate a lot of dirty buffers using bdwrite() or cluster_write(). The vnode lock is normally held exclusively throughout VOP_WRITE() calls, so there seems to be no way to complete the delayed writes until VOP_WRITE() returns (since flushbufqueues() needs to hold the vnode lock exlusively to write). Deadlock is handled in some cases by the buf_dirty_count_severe() hack: in just 2 file systems (ffs and msdosfs), in the main loop in VOP_WRITE(), if (vm_page_count_severe() || buf_dirty_count_severe()), then bawrite() is used to avoid creating [m]any more dirty buffers. (I don't like this because it makes the slow case even slower -- when the system gets congested doing writes, we switch to a slower writing method so the congestion will take even longer to clear.) Some file systems use the old pessimization of always writing complete blocks using bawrite(), so they don't need to call buf_dirty_count_severe() but are slow even without it. msdosfs did that until recently when I implemented write clustering in it. buf_dirty_count_severe() just uses the dirty buffer count, so it can only prevent buffer kva resource starvation by accident. The accident apparently doesn't happen with a block size of 64K (I think not just for msdosfs -- this may be why large block sizes for ffs are considered dangerous). When the nbufkv deadlock occurred, the dirty count was 0x5d3 and bufspace is 0x5d30000 -- bufspace consisted entirely of the 0x5d3 dirty buffers each of size 0x10000. hidirtycount was 0x709. Since this is larger than 0x5d3, the buf_dirty_count_severe() hack didn't help, but since it is not much larger, it almost helped. ffs also has a buf_dirty_count_severe() check in ffs_update(). This is missing in msdosfs and of course in all other file systems. This is less important that in the write loop, since at most one dirty buffer can be generated per vnode. But this check probably belongs in bdwrite() itself, so that most file systems don't forget to do it and so that all file systems don't forget to do it before all their bdwrite()'s except the ones in *fs_write() and *fs_update(). (ffs does about 50 bdwrite()'s without checking, mainly for indirect blocks and snapshots.) I think it is safe to blindly turn bdwrite() into bawrite(). ffs_update() actually blindly turns bdwrite() into bwrite() and returns the result, but this seems wrong since it makes the congested case even more congested than with bawrite(), for no advantage (callers shouldn't be checking the result in the !waitfor case that can use bdwrite()). Bruce