From owner-freebsd-current@FreeBSD.ORG  Mon Apr 21 23:20:18 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 6573737B401
	for <current@FreeBSD.org>; Mon, 21 Apr 2003 23:20:18 -0700 (PDT)
Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163])
	by mx1.FreeBSD.org (Postfix) with ESMTP id B9E3643F93
	for <current@FreeBSD.org>; Mon, 21 Apr 2003 23:20:17 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Received: from FreeBSD.org (scratch.catspoiler.org [192.168.101.3])
	by gw.catspoiler.org (8.12.6/8.12.6) with ESMTP id h3M6KBXB025919
	for <current@FreeBSD.org>; Mon, 21 Apr 2003 23:20:15 -0700 (PDT)
	(envelope-from truckman@FreeBSD.org)
Message-Id: <200304220620.h3M6KBXB025919@gw.catspoiler.org>
Date: Mon, 21 Apr 2003 23:20:11 -0700 (PDT)
From: Don Lewis <truckman@FreeBSD.org>
To: current@FreeBSD.org
MIME-Version: 1.0
Content-Type: TEXT/plain; charset=us-ascii
Subject: LOR in -current NFS client code + possible patch
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 22 Apr 2003 06:20:18 -0000

I had something wedge up in the NFS client code in a recent version of
-current.
	5.0-CURRENT #61: Sat Apr 19 00:36:17 PDT 2003

I got the console message:

Apr 21 20:52:07 scratch kernel: nfs server mousie:/home: not responding

TCP connections continued to work, and another client was still able to
access the server, so the problem was definitely in the client code.

When I attempted to kill the process that seemed to be responsible for
wedging NFS, I got a lock order reversal message:

Apr 21 20:54:33 scratch kernel: lock order reversal
Apr 21 20:54:33 scratch kernel: 1st 0xc893ab68 vnode interlock (vnode interlock)
 @ /usr/src/sys/nfsclient/nfs_vnops.c:2792
Apr 21 20:54:33 scratch kernel: 2nd 0xc69f4248 process lock (process lock) @ /us
r/src/sys/nfsclient/nfs_socket.c:1239
Apr 21 20:54:33 scratch kernel: Stack backtrace:

The backtrace (copied by hand):
	witness_lock()
	_mtx_lock_flags()
	nfs_sigintr() at nfs_sigintr+0x77
	nfs_flush() at nfs_flush+0x763
	nfs_close() at nfs_close+0x7a
	vn_close()
	vn_closefile()
	fdrop_locked()
	fdrop()
	closef()
	close()

I don't know what caused the original problem, but the lock order
reversal is caused by nfs_flush() calling nfs_sigintr() while holding a
vnode interlock, and nfs_sigintr() calls PROC_LOCK().

It looks to me like the following patch is the proper fix. There is
another call to nfs_sigintr() in nfs_flush(), but it looks like
BUF_TIMELOCK() must release the interlock in the error case.  Comments?

Index: nfs_vnops.c
===================================================================
RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v
retrieving revision 1.202
diff -u -r1.202 nfs_vnops.c
--- nfs_vnops.c	31 Mar 2003 23:26:10 -0000	1.202
+++ nfs_vnops.c	22 Apr 2003 06:03:28 -0000
@@ -2838,8 +2842,8 @@
 			error = msleep((caddr_t)&vp->v_numoutput, VI_MTX(vp),
 				slpflag | (PRIBIO + 1), "nfsfsync", slptimeo);
 			if (error) {
+			    VI_UNLOCK(vp);
 			    if (nfs_sigintr(nmp, NULL, td)) {
-				VI_UNLOCK(vp);
 				error = EINTR;
 				goto done;
 			    }
@@ -2847,6 +2851,7 @@
 				slpflag = 0;
 				slptimeo = 2 * hz;
 			    }
+			    VI_LOCK(vp);
 			}
 		}
 		if (!TAILQ_EMPTY(&vp->v_dirtyblkhd) && commit) {