From owner-freebsd-stable@FreeBSD.ORG  Thu Nov 13 07:54:01 2008
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0FD7F106564A
	for <freebsd-stable@freebsd.org>; Thu, 13 Nov 2008 07:54:01 +0000 (UTC)
	(envelope-from toasty@dragondata.com)
Received: from tokyo01.jp.mail.your.org (tokyo01.jp.mail.your.org [204.9.54.5])
	by mx1.freebsd.org (Postfix) with ESMTP id C31378FC1C
	for <freebsd-stable@freebsd.org>; Thu, 13 Nov 2008 07:54:00 +0000 (UTC)
	(envelope-from toasty@dragondata.com)
Received: from tokyo01.jp.mail.your.org (localhost.your.org [127.0.0.1])
	by tokyo01.jp.mail.your.org (Postfix) with ESMTP id A6BE72AD59D5
	for <freebsd-stable@freebsd.org>; Thu, 13 Nov 2008 07:34:13 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=dragondata.com; h=
	message-id:from:to:content-type:content-transfer-encoding
	:mime-version:subject:date; s=selector1; bh=SNRcWdUKG56O8wj64WbQ
	k5TXYeM=; b=gf9/UtRbL+Rp6gUj+VUyHhLRaalQA8pbimpAUSXhsyRxf0ynDJL6
	6kbcDLj1ZRoQKt8AsqYepyV+o1BvbQmnhNbnY9erC52ai50iLTtnKvV8jR8Xz1UC
	jY37JWRZXpuofccAe41cjHWBUTODl2sOSJV8mne/MllfgPLehfGPS0I=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=dragondata.com; h=message-id:from
	:to:content-type:content-transfer-encoding:mime-version:subject:
	date; q=dns; s=selector1; b=CLtO4zTNLbtxp36dLkzq6iP0EoL9d43YlQNZ
	AqhS3KfMOdLu65fn4GWVAajSa0aB83Tr/pW4EgSi8EBiHVLkZPUkNAeDLDhVvteq
	scfb9sPVEevunEtWKiKJxcUgJCOvq+kDfquugUeCCH5+XTJxc6JvUhPyMuKY14Oa
	jFAe5N4=
Received: from mail.your.org (server3-a.your.org [64.202.112.67])
	(using TLSv1 with cipher ADH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by tokyo01.jp.mail.your.org (Postfix) with ESMTPS id 6B4872AD597E
	for <freebsd-stable@freebsd.org>; Thu, 13 Nov 2008 07:34:13 +0000 (UTC)
Received: from [IPv6:2002:451f:630b:1::1] (unknown [IPv6:2002:451f:630b:1::1])
	(using TLSv1 with cipher AES128-SHA (128/128 bits))
	(No client certificate requested)
	by mail.your.org (Postfix) with ESMTPSA id C9981A0A406
	for <freebsd-stable@freebsd.org>; Thu, 13 Nov 2008 07:33:50 +0000 (UTC)
Message-Id: <DA52E1DB-FE0C-496D-86E7-55D79D4C1D0E@dragondata.com>
From: Kevin Day <toasty@dragondata.com>
To: freebsd-stable@freebsd.org
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
Content-Transfer-Encoding: 7bit
Mime-Version: 1.0 (Apple Message framework v929.2)
Date: Thu, 13 Nov 2008 01:34:07 -0600
X-Mailer: Apple Mail (2.929.2)
Subject: Re: System deadlock when using mksnap_ffs
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Nov 2008 07:54:01 -0000


(moving my thread from -fs to -stable)


Before touching anything, here's a description of the symptoms I  
see... Rather busy system, with quite a bit of filesystem activity  
occurring while the snapshot is being made. Quad CPU amd64 box with  
16GB of ram, 6x10Krpm RAID array. Should be reasonably fast.

Filesystem       1K-blocks     Used     Avail Capacity iused    ifree  
%iused  Mounted on
/dev/da0s1a      739339824 74357926 605834714    11% 1718540  
93855474    2%   /

1.7 million inodes, 71G used of a 705G volume.

Here's a timeline of what I see when starting to make a new snapshot.  
I've got a few windows running, showing "top", "iostat", etc.


Baseline disk activity before starting anything:

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       24.0   2.0   355.6    32.0    1  10.7  28


0m0s: Snapshot begins, using "mount -u -o snapshot //.snap/weekly. 
0 /"  Drives immediately jump to 100% busy as expected.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0      153.8   6.0  3378.6    95.9    2  16.9 100

the mount process is spending 100% of its time in "biord".


2m10s: The mount process starts spending more and more time in  
"snaplk", alternating with "biord".

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       77.9  67.9  1270.7  3754.2    1  10.7 100


12m15s: The first intermittent slowdowns start affecting other  
processes on the system. Occasionally all active processes will get  
stuck in "snaplk" or "ufs" for 5-10 seconds before resuming.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0       77.9  31.0  1150.8  1054.9    1  10.4 100


114m47s: Active processes are briefly stuck in "suspfs"

115m22s: Mount is now in "snaprdb", Active processes are now  
completely stuck in "snaplk". Still responsive to SIGINFO, top is  
still running, etc. Just hangs any time anything needs the filesystem.

device     r/s   w/s    kr/s    kw/s wait svc_t  b
da0      238.8   0.0  3820.1     0.0    1   4.1  99

143m19s: Mount now in wdrain.

143m34s: Finished.

snapshot logging shows "/: suspended 13.308 sec, redo 153 of 4058"   
Most processes were hung for 28 minutes.


Is this what others are seeing? It sounds like some of the complaints  
are it getting stuck in the "wdrain" state, not what I'm showing here.


Another mildly annoying note: Any process that touches ".snap" while a  
snapshot is being generated gets stuck in "ufs" until it finishes. I  
can understand wanting to keep operations in there in sync, but it  
would be really nice if "find /" wouldn't get hung when it tries to  
decent into .snap, for example.

ts5# cd /.snap
ts5# ls -l
^T
load: 0.17  cmd: ls 3696 [ufs] 0.00u 0.00s 0% 1496k