From owner-freebsd-stable@FreeBSD.ORG Mon Jan 2 22:48:22 2006 Return-Path: X-Original-To: freebsd-stable@FreeBSD.org Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 531D916A41F for ; Mon, 2 Jan 2006 22:48:22 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id 9950C43D55 for ; Mon, 2 Jan 2006 22:48:21 +0000 (GMT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id k02MmAPe005402; Mon, 2 Jan 2006 14:48:15 -0800 (PST) (envelope-from truckman@FreeBSD.org) Message-Id: <200601022248.k02MmAPe005402@gw.catspoiler.org> Date: Mon, 2 Jan 2006 14:48:10 -0800 (PST) From: Don Lewis To: gcr+freebsd-stable@tharned.org In-Reply-To: <20051122211507.P32523@nc8000.tharned.org> MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Cc: freebsd@McKusick.COM, freebsd-stable@FreeBSD.org, kris@obsecurity.org Subject: Re: Recurring problem: processes block accessing UFS file system X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 02 Jan 2006 22:48:22 -0000 On 22 Nov, Greg Rivers wrote: > On Mon, 21 Nov 2005, Kris Kennaway wrote: > >> It may not be the same problem. You should also try to obtain a trace >> when snapshots are not implicated. >> > > Agreed. I'll do so at the first opportunity. > > >> 'show lockedvnods' is very important for diagnosing filesystem >> deadlocks, and 'alltrace' is the easiest way to obtain stack traces for >> the resulting processes that are holding locks. You'll want to 'set >> $lines=0' and capture the output from the serial console to a file. >> > > Thanks Kris, these are exactly the clues I needed. Since the deadlock > during a snapshot is fairly easy to reproduce, I did so and collected this > information below. "alltrace" didn't work as I expected (didn't produce a > trace), so I traced each pid associated with a locked vnode separately. There appears to be a lock order reversal between a vnode lock and the hand-rolled suspend lock. Basically, it is not permissible to wait for the suspend lock while holding a vnode lock. In cases where a thread holds a vnode lock and then wants to start a file system write, it has to make a non-blocking attempt to grab the suspend lock and then release the vnode lock if the suspend lock can't be obtained. For example, in vn_open_cred(): if (vn_start_write(ndp->ni_dvp, &mp, V_NOWAIT) != 0) { NDFREE(ndp, NDF_ONLY_PNBUF); vput(ndp->ni_dvp); VFS_UNLOCK_GIANT(vfslocked); if ((error = vn_start_write(NULL, &mp, V_XSLEEP | PCATCH)) != 0) return (error); goto restart; } The problem is taht vput() may call vinactive(), which calls ufs_inactive(), which may attempt to do a file system write and block on the snapshot lock. db> trace 99361 Tracing pid 99361 tid 100437 td 0xce67cd80 sched_switch(ce67cd80,0,1,3d4e53a2,95c40548) at sched_switch+0x158 mi_switch(1,0,c04d7b33,c86d706c,ebb00908) at mi_switch+0x1d5 sleepq_switch(c86d706c,ebb0093c,c04bb9ce,c86d706c,9f) at sleepq_switch+0x16f sleepq_wait(c86d706c,9f,c061a026,0,c0647200) at sleepq_wait+0x11 msleep(c86d706c,c86d7044,29f,c061a026,0,c0653720,c8dc1aa0,ebb00978,ce67cd80) at msleep+0x3d7 vn_write_suspend_wait(c8dc1aa0,c86d7000,1,0,ca07f817) at vn_write_suspend_wait+0 x181 ufs_inactive(ebb009c8,ebb009dc,c051b330,c0646cc0,ebb009c8) at ufs_inactive+0x1b4 VOP_INACTIVE_APV(c0646cc0,ebb009c8,a03,ebb009d0,c051054f) at VOP_INACTIVE_APV+0x 3a vinactive(c8dc1aa0,ce67cd80,ffffffdf,ebb00a24,c0513592) at vinactive+0x82 vput(c8dc1aa0,ffffffdf,2,ebb00a50,0) at vput+0x187 vn_open_cred(ebb00bc0,ebb00cc0,180,c9430580,5) at vn_open_cred+0xfb vn_open(ebb00bc0,ebb00cc0,180,5,4b5fcfad) at vn_open+0x33 kern_open(ce67cd80,81414e0,0,a03,180) at kern_open+0xca open(ce67cd80,ebb00d04,c,ce67cd80,8155000) at open+0x36 syscall(3b,3b,3b,0,a02) at syscall+0x324 99361 ce682624 100 673 673 0000100 [SLPQ suspfs 0xc86d706c][SLP] sendmail If a thread gets stuck waiting for the suspend lock while holding a vnode lock, then it is possible for the thread that is creating the snapshot to get stuck when it attempts to lock that same vnode. 98639 c8e91418 11008 98637 97958 0004102 [SLPQ ufs 0xc8991d18][SLP] mksnap_ffs db> trace 98639 Tracing pid 98639 tid 100282 td 0xc8e8e300 sched_switch(c8e8e300,0,1,cc167362,733b6597) at sched_switch+0x158 mi_switch(1,0,c04d7b33,c8991d18,eb80c544) at mi_switch+0x1d5 sleepq_switch(c8991d18,eb80c578,c04bb9ce,c8991d18,50) at sleepq_switch+0x16f sleepq_wait(c8991d18,50,c06186b3,0,c065ae80) at sleepq_wait+0x11 msleep(c8991d18,c065888c,50,c06186b3,0) at msleep+0x3d7 acquire(eb80c5e0,40,60000,c8e8e300,0) at acquire+0x89 lockmgr(c8991d18,2002,c8991d3c,c8e8e300,eb80c608) at lockmgr+0x45f vop_stdlock(eb80c660,0,c0646cc0,eb80c660,eb80c618) at vop_stdlock+0x2f VOP_LOCK_APV(c0647200,eb80c660,eb80c630,c05f9f43,eb80c660) at VOP_LOCK_APV+0x44 ffs_lock(eb80c660,0,2002,c8991cc0,eb80c67c) at ffs_lock+0x19 VOP_LOCK_APV(c0646cc0,eb80c660,c06402e0,eb80c7d0,0) at VOP_LOCK_APV+0x44 vn_lock(c8991cc0,2002,c8e8e300,4000,c851f300) at vn_lock+0x132 ffs_snapshot(c86d7000,cc978e00,eb80c9a4,6c,eb80c964) at ffs_snapshot+0x152b ffs_mount(c86d7000,c8e8e300,0,c8e8e300,c9f71bb0) at ffs_mount+0x9af vfs_domount(c8e8e300,c88a8760,c88a8870,11211000,c9a77510) at vfs_domount+0x728 vfs_donmount(c8e8e300,11211000,eb80cbec,c9314580,e) at vfs_donmount+0x12e kernel_mount(c8737b80,11211000,eb80cc30,6c,bfbfe8c8) at kernel_mount+0x46 ffs_cmount(c8737b80,bfbfe110,11211000,c8e8e300,0) at ffs_cmount+0x85 mount(c8e8e300,eb80cd04,10,c8e8e300,8052000) at mount+0x21b syscall(3b,3b,3b,8814819c,bfbfe0b0) at syscall+0x324 Xint0x80_syscall() at Xint0x80_syscall+0x1f This block of code in ufs_inactive() is triggering the problem: if (ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) { if ((ip->i_flag & (IN_CHANGE | IN_UPDATE | IN_MODIFIED)) == 0 && vn_write_suspend_wait(vp, NULL, V_NOWAIT)) { ip->i_flag &= ~IN_ACCESS; } else { (void) vn_write_suspend_wait(vp, NULL, V_WAIT); UFS_UPDATE(vp, 0); } } I think the intent is to defer access time updates while writes are suspended, but wait for writes to be allowed before syncing the other timestamps. Hmn, I wonder why any of IN_CHANGE, IN_UPDATE, or IN_MODIFIED are set at this point. They should have been cleared when the file system was synced before MNTK_SUSPENDED was set. I suspect it is all the handle_workitem_foo() calls done by softdep_flushworklist() that set IN_CHANGE. I don't see anything that flushes those changes to disk other maybe the syncer thread ... Ugh! Another bug that I just noticed is that ffs_snapshot() clears MNTK_SUSPENDED before entering its MNT_VNODE_FOREACH() loop, which might re-enable access time updates by other threads. Probably some other flag should be set to allow this thread to bypass the writes forbidden while MNTK_SUSPENDED is set sanity check.