From owner-freebsd-current@FreeBSD.ORG  Sun Jul 18 13:02:29 2010
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id EF1C0106567D;
	Sun, 18 Jul 2010 13:02:29 +0000 (UTC)
	(envelope-from gavin@ury.york.ac.uk)
Received: from mail-gw2.york.ac.uk (mail-gw2.york.ac.uk [144.32.128.247])
	by mx1.freebsd.org (Postfix) with ESMTP id 73B6B8FC1C;
	Sun, 18 Jul 2010 13:02:29 +0000 (UTC)
Received: from ury.york.ac.uk (ury.york.ac.uk [144.32.108.81])
	by mail-gw2.york.ac.uk (8.13.6/8.13.6) with ESMTP id o6ID2O1A012886;
	Sun, 18 Jul 2010 14:02:24 +0100 (BST)
Received: from gavin (helo=localhost)
	by ury.york.ac.uk with local-esmtp (Exim 4.72)
	(envelope-from <gavin@ury.york.ac.uk>)
	id 1OaTVg-0001HE-J1; Sun, 18 Jul 2010 14:02:24 +0100
Date: Sun, 18 Jul 2010 14:02:24 +0100 (BST)
From: Gavin Atkinson <gavin@FreeBSD.org>
X-X-Sender: gavin@ury.york.ac.uk
To: freebsd-current@FreeBSD.org
In-Reply-To: <alpine.LNX.2.00.1007171747130.19876@ury.york.ac.uk>
Message-ID: <alpine.LNX.2.00.1007181352110.19876@ury.york.ac.uk>
References: <alpine.LNX.2.00.1007171747130.19876@ury.york.ac.uk>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: <gavin@ury.york.ac.uk>
X-York-MailScanner: Found to be clean
X-York-MailScanner-From: gavin@ury.york.ac.uk
Cc: jeff@FreeBSD.org
Subject: Re: Filesystem wedge, SUJ-related?
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Jul 2010 13:02:30 -0000

On Sat, 17 Jul 2010, Gavin Atkinson wrote:
> Semi-regularly (every two-three days) I'm seeing what appears to be some 
> sort of filesystem wedge.  I usually see it initially with web browsers, 
> but it's possible that's only because it's what produces most disk 
> activity on this machine.  I've seen it with both Opera and Firefox.
> 
> What happens is that the process will just wedge.  A "procstat -kk" on it 
> shows the following stack backtrace:
> 
>  9012 100243 firefox-bin      initial thread   mi_switch+0x21d 
> sleepq_switch+0x123 sleepq_wait+0x4d _sleep+0x357 getdirtybuf+0x21e 
> flush_deplist+0x6f softdep_sync_metadata+0x153 ffs_syncvnode+0x213 
> ffs_fsync+0x43 fsync+0x148 syscallenter+0x1b5 syscall+0x4c 
> Xfast_syscall+0xe2 

A bit more detail: it does look like whatever is supposed to periodically 
flush the journal just stops doing it's job.  Presumably this is also the 
root cause of the "softdep: Out of journal space!" messages I have been 
seeing in the past, which I had assumed may have been fixed by r209717.

(I'm running r209723 at the moment)

While processes are starting to hang, "sh ffs" from ddb shows:

db> sh ffs
mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000 su_wl 0 su_wl_in 0 su_deps 0 su_req 0
mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c64800 su_wl 0 su_wl_in 0 su_deps 0 su_req 0
mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c66000 su_wl 0 su_wl_in 0 su_deps 17345 su_req 0
mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c66800 su_wl 0 su_wl_in 0 su_deps 55 su_req 0

Leaving it another couple of hours, I then see:

db> sh ffs
mp 0xffffff0002c45be0 / devvp 0xffffff0002c51000 fs 0xffffff0002c67000 su_wl 0 su_wl_in 0 su_deps 0 su_req 0
mp 0xffffff0002d705f0 /tmp devvp 0xffffff0002d48780 fs 0xffffff0002c64800 su_wl 0 su_wl_in 0 su_deps 36 su_req 0
mp 0xffffff0002c458e8 /usr devvp 0xffffff0002d485a0 fs 0xffffff0002c66000 su_wl 0 su_wl_in 0 su_deps 31899 su_req 0
mp 0xffffff0002c455f0 /var devvp 0xffffff0002d483c0 fs 0xffffff0002c66800 su_wl 0 su_wl_in 0 su_deps 95 su_req 0

so, su_deps is increasing significantly.

During reboot, vnlru failed to stop within 60 seconds, and gave up on 
syncing 125 vnodes and 140 buffers (no idea if these are related).  On 
reboot, SU+J fsck shows for /usr: 

** SU+J Recovering /dev/ad4s1f
** Reading 33554432 byte journal from inode 150.
** Building recovery table.
** Resolving unreferenced inode list.
** Processing journal entries.
** 405991 journal records in 18194944 bytes for 71.40% utilization
** Freed 3872 inodes (0 dirs) 48157 blocks, and 8744 frags.

So it seems clear that somehow the journal is filling up, and never being 
written.

Any other suggestions as to where I should go from here?

Thanks,

Gaivn