From owner-freebsd-stable@FreeBSD.ORG  Mon Jan  2 22:48:22 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@FreeBSD.org
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 531D916A41F
	for <freebsd-stable@FreeBSD.org>; Mon,  2 Jan 2006 22:48:22 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 9950C43D55
	for <freebsd-stable@FreeBSD.org>; Mon,  2 Jan 2006 22:48:21 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id k02MmAPe005402;
	Mon, 2 Jan 2006 14:48:15 -0800 (PST)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200601022248.k02MmAPe005402@gw.catspoiler.org>
Date: Mon, 2 Jan 2006 14:48:10 -0800 (PST)
From: Don Lewis <truckman@FreeBSD.org>
To: gcr+freebsd-stable@tharned.org
In-Reply-To: <20051122211507.P32523@nc8000.tharned.org>
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: freebsd@McKusick.COM, freebsd-stable@FreeBSD.org, kris@obsecurity.org
Subject: Re: Recurring problem: processes block accessing UFS file system
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 02 Jan 2006 22:48:22 -0000

On 22 Nov, Greg Rivers wrote:
> On Mon, 21 Nov 2005, Kris Kennaway wrote:
> 
>> It may not be the same problem.  You should also try to obtain a trace 
>> when snapshots are not implicated.
>>
> 
> Agreed.  I'll do so at the first opportunity.
> 
> 
>> 'show lockedvnods' is very important for diagnosing filesystem 
>> deadlocks, and 'alltrace' is the easiest way to obtain stack traces for 
>> the resulting processes that are holding locks.  You'll want to 'set 
>> $lines=0' and capture the output from the serial console to a file.
>>
> 
> Thanks Kris, these are exactly the clues I needed.  Since the deadlock 
> during a snapshot is fairly easy to reproduce, I did so and collected this 
> information below.  "alltrace" didn't work as I expected (didn't produce a 
> trace), so I traced each pid associated with a locked vnode separately.

There appears to be a lock order reversal between a vnode lock and the
hand-rolled suspend lock.  Basically, it is not permissible to wait for
the suspend lock while holding a vnode lock.  In cases where a thread
holds a vnode lock and then wants to start a file system write, it has
to make a non-blocking attempt to grab the suspend lock and then release
the vnode lock if the suspend lock can't be obtained.  For example, in
vn_open_cred():

                        if (vn_start_write(ndp->ni_dvp, &mp, V_NOWAIT) != 0) {
                                NDFREE(ndp, NDF_ONLY_PNBUF);
                                vput(ndp->ni_dvp);
                                VFS_UNLOCK_GIANT(vfslocked);
                                if ((error = vn_start_write(NULL, &mp,
                                    V_XSLEEP | PCATCH)) != 0)
                                        return (error);
                                goto restart;
                        }


The problem is taht vput() may call vinactive(), which calls
ufs_inactive(), which may attempt to do a file system write and block on
the snapshot lock.

db> trace 99361
Tracing pid 99361 tid 100437 td 0xce67cd80
sched_switch(ce67cd80,0,1,3d4e53a2,95c40548) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c86d706c,ebb00908) at mi_switch+0x1d5
sleepq_switch(c86d706c,ebb0093c,c04bb9ce,c86d706c,9f) at sleepq_switch+0x16f
sleepq_wait(c86d706c,9f,c061a026,0,c0647200) at sleepq_wait+0x11
msleep(c86d706c,c86d7044,29f,c061a026,0,c0653720,c8dc1aa0,ebb00978,ce67cd80) at
msleep+0x3d7
vn_write_suspend_wait(c8dc1aa0,c86d7000,1,0,ca07f817) at vn_write_suspend_wait+0
x181
ufs_inactive(ebb009c8,ebb009dc,c051b330,c0646cc0,ebb009c8) at ufs_inactive+0x1b4
VOP_INACTIVE_APV(c0646cc0,ebb009c8,a03,ebb009d0,c051054f) at VOP_INACTIVE_APV+0x
3a
vinactive(c8dc1aa0,ce67cd80,ffffffdf,ebb00a24,c0513592) at vinactive+0x82
vput(c8dc1aa0,ffffffdf,2,ebb00a50,0) at vput+0x187
vn_open_cred(ebb00bc0,ebb00cc0,180,c9430580,5) at vn_open_cred+0xfb
vn_open(ebb00bc0,ebb00cc0,180,5,4b5fcfad) at vn_open+0x33
kern_open(ce67cd80,81414e0,0,a03,180) at kern_open+0xca
open(ce67cd80,ebb00d04,c,ce67cd80,8155000) at open+0x36
syscall(3b,3b,3b,0,a02) at syscall+0x324

99361 ce682624  100   673   673 0000100 [SLPQ suspfs 0xc86d706c][SLP] sendmail


If a thread gets stuck waiting for the suspend lock while holding a
vnode lock, then it is possible for the thread that is creating the
snapshot to get stuck when it attempts to lock that same vnode.

98639 c8e91418 11008 98637 97958 0004102 [SLPQ ufs 0xc8991d18][SLP] mksnap_ffs

db> trace 98639
Tracing pid 98639 tid 100282 td 0xc8e8e300
sched_switch(c8e8e300,0,1,cc167362,733b6597) at sched_switch+0x158
mi_switch(1,0,c04d7b33,c8991d18,eb80c544) at mi_switch+0x1d5
sleepq_switch(c8991d18,eb80c578,c04bb9ce,c8991d18,50) at sleepq_switch+0x16f
sleepq_wait(c8991d18,50,c06186b3,0,c065ae80) at sleepq_wait+0x11  
msleep(c8991d18,c065888c,50,c06186b3,0) at msleep+0x3d7
acquire(eb80c5e0,40,60000,c8e8e300,0) at acquire+0x89
lockmgr(c8991d18,2002,c8991d3c,c8e8e300,eb80c608) at lockmgr+0x45f
vop_stdlock(eb80c660,0,c0646cc0,eb80c660,eb80c618) at vop_stdlock+0x2f
VOP_LOCK_APV(c0647200,eb80c660,eb80c630,c05f9f43,eb80c660) at VOP_LOCK_APV+0x44
ffs_lock(eb80c660,0,2002,c8991cc0,eb80c67c) at ffs_lock+0x19
VOP_LOCK_APV(c0646cc0,eb80c660,c06402e0,eb80c7d0,0) at VOP_LOCK_APV+0x44
vn_lock(c8991cc0,2002,c8e8e300,4000,c851f300) at vn_lock+0x132
ffs_snapshot(c86d7000,cc978e00,eb80c9a4,6c,eb80c964) at ffs_snapshot+0x152b
ffs_mount(c86d7000,c8e8e300,0,c8e8e300,c9f71bb0) at ffs_mount+0x9af
vfs_domount(c8e8e300,c88a8760,c88a8870,11211000,c9a77510) at vfs_domount+0x728
vfs_donmount(c8e8e300,11211000,eb80cbec,c9314580,e) at vfs_donmount+0x12e
kernel_mount(c8737b80,11211000,eb80cc30,6c,bfbfe8c8) at kernel_mount+0x46
ffs_cmount(c8737b80,bfbfe110,11211000,c8e8e300,0) at ffs_cmount+0x85
mount(c8e8e300,eb80cd04,10,c8e8e300,8052000) at mount+0x21b
syscall(3b,3b,3b,8814819c,bfbfe0b0) at syscall+0x324
Xint0x80_syscall() at Xint0x80_syscall+0x1f


This block of code in ufs_inactive() is triggering the problem:

        if (ip->i_flag & (IN_ACCESS | IN_CHANGE | IN_MODIFIED | IN_UPDATE)) {
                if ((ip->i_flag & (IN_CHANGE | IN_UPDATE | IN_MODIFIED)) == 0 &&
                    vn_write_suspend_wait(vp, NULL, V_NOWAIT)) {
                        ip->i_flag &= ~IN_ACCESS;
                } else {
                        (void) vn_write_suspend_wait(vp, NULL, V_WAIT);
                        UFS_UPDATE(vp, 0);
                }
        }

I think the intent is to defer access time updates while writes are
suspended, but wait for writes to be allowed before syncing the other
timestamps.  Hmn, I wonder why any of IN_CHANGE, IN_UPDATE, or
IN_MODIFIED are set at this point.  They should have been cleared when
the file system was synced before MNTK_SUSPENDED was set.  I suspect it
is all the handle_workitem_foo() calls done by softdep_flushworklist()
that set IN_CHANGE.  I don't see anything that flushes those changes to
disk other maybe the syncer thread ... Ugh!


Another bug that I just noticed is that ffs_snapshot() clears
MNTK_SUSPENDED before entering its MNT_VNODE_FOREACH() loop, which might
re-enable access time updates by other threads.  Probably some other
flag should be set to allow this thread to bypass the writes forbidden
while MNTK_SUSPENDED is set sanity check.