From owner-freebsd-current@FreeBSD.ORG  Wed Sep 28 06:09:05 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: current@FreeBSD.org
Delivered-To: freebsd-current@FreeBSD.ORG
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id AA9DF16A41F
	for <current@FreeBSD.org>; Wed, 28 Sep 2005 06:09:05 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3BD3B43D48
	for <current@FreeBSD.org>; Wed, 28 Sep 2005 06:09:04 +0000 (GMT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2])
	by gw.catspoiler.org (8.13.3/8.13.3) with ESMTP id j8S68vZ3000590
	for <current@FreeBSD.org>; Tue, 27 Sep 2005 23:09:01 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200509280609.j8S68vZ3000590@gw.catspoiler.org>
Date: Tue, 27 Sep 2005 23:08:57 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: current@FreeBSD.org
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Cc: 
Subject: analysis of snapshot-related system deadlock
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 28 Sep 2005 06:09:05 -0000

I've been experimenting with Peter Holm's kernel stress test suite and
file system snapshots.  I've been frequently seeing system deadlocks, so
I went looking for the cause.

In the latest instance, there were 12 threads waiting on "snaplk", and
the thread holding "snaplk" was sleeping on "wdrain".  Two of the
threads waiting on "snaplk" were syncer and bufdaemon, which is not a
good sign.

Ordinarily, I/O activity should eventually reduce runningbufspace below
lorunningspace and wake up the thread sleeping on "wdrain", but this is
where the problem gets interesting.  The stack trace of the thread
sleeping on "wdrain" is:

#0  0xc0653913 in sched_switch (td=0xc23fe300, newtd=0xc2275480, flags=1)
    at /usr/src/sys/kern/sched_4bsd.c:973
#1  0xc0649158 in mi_switch (flags=1, newtd=0x0)
    at /usr/src/sys/kern/kern_synch.c:356
#2  0xc066073c in sleepq_switch (wchan=0x0)
    at /usr/src/sys/kern/subr_sleepqueue.c:427
#3  0xc0660920 in sleepq_wait (wchan=0xc0984404)
    at /usr/src/sys/kern/subr_sleepqueue.c:539
#4  0xc0648dc9 in msleep (ident=0xc0984404, mtx=0xc0984420, priority=68, 
    wmesg=0xc0876f1c "wdrain", timo=0) at /usr/src/sys/kern/kern_synch.c:227
#5  0xc0687592 in bufwrite (bp=0xd648f558) at /usr/src/sys/kern/vfs_bio.c:383
#6  0xc0687bbd in bawrite (bp=0x0) at buf.h:401
#7  0xc077ca98 in ffs_copyonwrite (devvp=0xc2933770, bp=0xd6543e90)
    at /usr/src/sys/ufs/ffs/ffs_snapshot.c:2119
#8  0xc0788ec5 in ffs_geom_strategy (bo=0xc2933830, bp=0xd6543e90)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1686
#9  0xc068750e in bufwrite (bp=0xd6543e90) at buf.h:415
#10 0xc0788e32 in ffs_bufwrite (bp=0xd6543e90)
    at /usr/src/sys/ufs/ffs/ffs_vfsops.c:1663
#11 0xc0775a09 in ffs_update (vp=0xc5095cc0, waitfor=0) at buf.h:401
#12 0xc0793670 in ufs_mkdir (ap=0xeb785bb8)
    at /usr/src/sys/ufs/ufs/ufs_vnops.c:1556
#13 0xc08149e7 in VOP_MKDIR_APV (vop=0xc0910b60, a=0xeb785bb8)


The problem is that bufs passed through ffs_copyonwrite() get double
counted in runningbufspace, once for each pass through bufwrite().  This
includes the bufs being processed by all the threads that are waiting on
"snaplk".  If enough threads get backed up waiting for "snaplk", the
total size bufs they are processing will exceed lorunningspace and any
threads sleeping on wdrain will sleep forever.

Probably the easiest fix would be to call runningbufwakeup() from
ffs_copyonwrite() before grabbing "snaplk", and increase runningbufspace
again before returning from ffs_copyonwrite().  The bufs waiting for
"snaplk" aren't yet async writes currently running, to borrow from the
comment on waitrunningbufspace().