From owner-freebsd-stable@FreeBSD.ORG  Tue Jan 10 23:59:51 2006
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
X-Original-To: freebsd-stable@freebsd.org
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 7610F16A41F
	for <freebsd-stable@freebsd.org>; Tue, 10 Jan 2006 23:59:51 +0000 (GMT)
	(envelope-from dworkin@village.org)
Received: from green-dome.village.org (green-dome.village.org [168.103.84.186])
	by mx1.FreeBSD.org (Postfix) with ESMTP id DF1C343D45
	for <freebsd-stable@freebsd.org>; Tue, 10 Jan 2006 23:59:50 +0000 (GMT)
	(envelope-from dworkin@village.org)
Received: from green-dome.village.org (localhost.village.org [127.0.0.1])
	by green-dome.village.org (8.11.0/8.11.0) with ESMTP id k0ANxnB25342
	for <freebsd-stable@freebsd.org>; Tue, 10 Jan 2006 16:59:49 -0700 (MST)
Message-Id: <200601102359.k0ANxnB25342@green-dome.village.org>
To: freebsd-stable@freebsd.org
From: dlm-fb@weaselfish.com
Date: Tue, 10 Jan 2006 16:59:49 -0700
Subject: [5.4] getting stuck in nfs_rcvlock()
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Jan 2006 23:59:51 -0000


I'm regularly running into a situation on 5.4 (current as of a few
weeks ago) where a multi-threaded process comes to a screeching halt
due to a vnode lock dependency chain.  In other words, many of the
threads have various vnodes locked and are trying to acquire another.
When you chase down to the end of it, the one thread that's not
waiting for a vnode lock is asleep in nfs_rcvlock().  A typical
traceback is:

Tracing pid 380 tid 100274 td 0xc5e9c600
sched_switch(c5e9c600,0,2) at sched_switch+0x143
mi_switch(2,0) at mi_switch+0x1ba
thread_suspend_check(1) at thread_suspend_check+0x181
sleepq_catch_signals(c70bdda4,0,0,100,c6779a80) at
sleepq_catch_signals+0xe9
msleep(c70bdda4,0,153,c08cd4ae,0) at msleep+0x239
nfs_rcvlock(c6cb2300) at nfs_rcvlock+0x63
nfs_reply(c6cb2300,0,f,0,18) at nfs_reply+0x18
nfs_request(c814b948,c6b72700,3,c5e9c600,c51e1780) at
nfs_request+0x2f1
nfs_lookup(eba4b9d8) at nfs_lookup+0x53d
lookup(eba4bbd4,81a4,c5e9c600,0,c5bb9000) at lookup+0x2cf
namei(eba4bbd4) at namei+0x285
vn_open_cred(eba4bbd4,eba4bcd4,180,c51e1780,27) at vn_open_cred+0x5b
vn_open(eba4bbd4,eba4bcd4,180,27,c7c71318) at vn_open+0x1e
kern_open(c5e9c600,888a406,0,602,180) at kern_open+0xeb
open(c5e9c600,eba4bd04,3,0,296) at open+0x18
syscall(80a002f,2f,bdfc002f,8099200,180) at syscall+0x2b3
Xint0x80_syscall() at Xint0x80_syscall+0x1f
--- syscall (5, FreeBSD ELF32, open), eip = 0x48108bcb, esp = 0xbe9cedbc, ebp = 0xbe9cede8 ---

The obvious interpretation is a non-responding nfs server.  However,
no complaints of that sort are being generated by the client that's
showing the hang.

I've chased all sorts of herrings, red or otherwise, and have not yet
come up with a good explanation, let alone a fix.  It's almost like
the request is never getting transmitted.  One promising line was that
some debug writes showed retransmits being suppressed due to the
congestion avoidance window being too small and never opening back up.
However, while artificially opening it helped a little bit on
throughput, it didn't make the hang go away.

An additional data point is that the vnode that the hanging request is
referencing is for a directory that we're in the process of creating a
file in.  There is always at least one other thread from the same
process waiting on the vnode to become free so it can create its own
file in the same directory.  Unfortunately, the application is rather
unwieldy to try to split up into multiple processes to see if it's
somehow related to threads rather than nfs per se.

Does this sort of problem sound at all familiar to anyone?

Any help will be greatly appreciated.  I'm running out of hair to
pull....

	Dworkin