From owner-freebsd-fs@FreeBSD.ORG Tue Aug 22 09:15:45 2006 Return-Path: X-Original-To: freebsd-fs@freebsd.org Delivered-To: freebsd-fs@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 34EB716A4DA; Tue, 22 Aug 2006 09:15:45 +0000 (UTC) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.115]) by mx1.FreeBSD.org (Postfix) with ESMTP id 4C0C943D73; Tue, 22 Aug 2006 09:15:44 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy1.pacific.net.au (mailproxy1.pacific.net.au [61.8.2.162]) by mailout2.pacific.net.au (Postfix) with ESMTP id C7FDE6EA66; Tue, 22 Aug 2006 19:13:19 +1000 (EST) Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy1.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP id k7M9DF4K006823; Tue, 22 Aug 2006 19:13:17 +1000 Date: Tue, 22 Aug 2006 19:13:15 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Tor Egge In-Reply-To: <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> Message-ID: <20060822175540.V58720@delplex.bde.org> References: <20060817170314.GA17490@peter.osted.lan> <20060818164903.GF20768@deviant.kiev.zoral.com.ua> <20060818.202001.74745664.Tor.Egge@cvsup.no.freebsd.org> <20060821.132151.41668008.Tor.Egge@cvsup.no.freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org, tegge@freebsd.org Subject: Re: Deadlock between nfsd and snapshots. X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Aug 2006 09:15:45 -0000 On Mon, 21 Aug 2006, Tor Egge wrote: > I wrote: > >> The deadlock indicates that one or more of IN_CHANGE, IN_MODIFIED or >> IN_UPDATE was set on the inode, indicating a write operation >> (e.g. VOP_WRITE(), VOP_RENAME(), VOP_CREATE(), VOP_REMOVE(), VOP_LINK(), >> VOP_SYMLINK(), VOP_SETATTR(), VOP_MKDIR(), VOP_RMDIR(), VOP_MKNOD()) that was >> not protected by vn_start_write() or vn_start_secondary_write(). > > The most common "write" operation was probably VOP_GETATTR(). Reading the attributes really is a write operation if it causes marks for update to be turned into updates. ufs_inactive() is also a write operation if it causes this. I think marking for update shouldn't require much locking. > ufs_itimes(), called from ufs_getattr(), might set the IN_MODIFIED inode flag > if IN_ACCESS is set on the inode even if neither IN_CHANGE nor IN_UPDATE is > set, transitioning the inode flags into a state where ufs_inactive() calls the > blocking variant of vn_start_secondary_write(). In the implementation of marking access times for update on exec, we try hard to avoid calling vn_start_write(), etc. Early implementations did call vn_start_write(), etc., and had some bugs from this. The current implementation is mainly for ffs and is rather fragile: VOP_SETATTR() depends on callers calling vfs_start_write, but vfs_mark_atime() calls VOP_SETATTR() without calling vfs_start_write(). The correctness of this depends on the VA_MARK_ATIME case of VOP_SETATTR() not writing any more than VOP_READ() would or should. I think ufs_itimes() shouldn't call vn_start_write() any more than ufs_getattr() should. Callers should be aware that GETATTR may write and thus it seems to be necessary for them to call vn_start_write() unconditionally. ufs_itimes() is a utility function that is called from places other than ufs_getattr(). The other places are ufs_*close() and ufs_setattr(). These don't cause any additional problems. VOP_CLOSE() is called from several places which already seem to have sufficient locking. VOP_GETATTR() is called from many places that don't call vn_start_write(), starting with vn_stat(). The updates to timestamps in in ufs_itimes() and ufs_getattr() are still soft (not even delayed writes, but writes to the vnode that will usually become delayed writes later). Perhaps vn_start_write() can be avoided for them, but since they are logically writes this might be hard to implement correctly. ufs_inactive() has to do a (possibly delayed) physical write to force any updates to disk, so it needs strong locking. BTW, ufs_itimes() has some possibly related kludges involving changing from r/w mounts to r/o mounts. Some of the marks for update aren't handled quite right and are still present after the change. Then they want to be turned into updates on a r/o file system. This is impossible. The problem is handled bogusly, essentially by clearing them without doing the update, in a way that triggers seome of my local debugging code. > calling ufs_itimes() with only a shared vnode lock might cause unsafe accesses > to the inode flags. Setting of IN_ACCESS at the end of ffs_read() and > ffs_extread() might also be unsafe. If DIRECTIO is enabled then O_DIRECT reads > might not even attempt to set the IN_ACCESS flag. I think setting it should be safe -- see above. Not setting the IN_ACCESS flag in the early return for the O_DIRECT case is a different bug. read() is supposed to set IN_ACCESS on sucessful completion and returning early for the O_DIRECT case would defeat this. Bruce