Date: Thu, 27 Sep 2007 17:59:43 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: fs@freebsd.org Subject: deaadlock for large writes Message-ID: <20070927175933.L770@besplex.bde.org>
next in thread | raw e-mail | index | archive | help
I'm getting deadlock with wait message "nbufkv" for "dd if=/dev/zero of=zz bs=1m count=1000" to an msdosfs file system with a block size of 64K. There seems to be nothing except the buf_dirty_count_severe() hack to prevent deadlock for large writes in general. Large writes want to generate a lot of dirty buffers using bdwrite() or cluster_write(). The vnode lock is normally held exclusively throughout VOP_WRITE() calls, so there seems to be no way to complete the delayed writes until VOP_WRITE() returns (since flushbufqueues() needs to hold the vnode lock exlusively to write). Deadlock is handled in some cases by the buf_dirty_count_severe() hack: in just 2 file systems (ffs and msdosfs), in the main loop in VOP_WRITE(), if (vm_page_count_severe() || buf_dirty_count_severe()), then bawrite() is used to avoid creating [m]any more dirty buffers. (I don't like this because it makes the slow case even slower -- when the system gets congested doing writes, we switch to a slower writing method so the congestion will take even longer to clear.) Some file systems use the old pessimization of always writing complete blocks using bawrite(), so they don't need to call buf_dirty_count_severe() but are slow even without it. msdosfs did that until recently when I implemented write clustering in it. buf_dirty_count_severe() just uses the dirty buffer count, so it can only prevent buffer kva resource starvation by accident. The accident apparently doesn't happen with a block size of 64K (I think not just for msdosfs -- this may be why large block sizes for ffs are considered dangerous). When the nbufkv deadlock occurred, the dirty count was 0x5d3 and bufspace is 0x5d30000 -- bufspace consisted entirely of the 0x5d3 dirty buffers each of size 0x10000. hidirtycount was 0x709. Since this is larger than 0x5d3, the buf_dirty_count_severe() hack didn't help, but since it is not much larger, it almost helped. ffs also has a buf_dirty_count_severe() check in ffs_update(). This is missing in msdosfs and of course in all other file systems. This is less important that in the write loop, since at most one dirty buffer can be generated per vnode. But this check probably belongs in bdwrite() itself, so that most file systems don't forget to do it and so that all file systems don't forget to do it before all their bdwrite()'s except the ones in *fs_write() and *fs_update(). (ffs does about 50 bdwrite()'s without checking, mainly for indirect blocks and snapshots.) I think it is safe to blindly turn bdwrite() into bawrite(). ffs_update() actually blindly turns bdwrite() into bwrite() and returns the result, but this seems wrong since it makes the congested case even more congested than with bawrite(), for no advantage (callers shouldn't be checking the result in the !waitfor case that can use bdwrite()). Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20070927175933.L770>