From owner-freebsd-current@FreeBSD.ORG Mon Apr 21 23:20:18 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 6573737B401 for ; Mon, 21 Apr 2003 23:20:18 -0700 (PDT) Received: from gw.catspoiler.org (217-ip-163.nccn.net [209.79.217.163]) by mx1.FreeBSD.org (Postfix) with ESMTP id B9E3643F93 for ; Mon, 21 Apr 2003 23:20:17 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (scratch.catspoiler.org [192.168.101.3]) by gw.catspoiler.org (8.12.6/8.12.6) with ESMTP id h3M6KBXB025919 for ; Mon, 21 Apr 2003 23:20:15 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <200304220620.h3M6KBXB025919@gw.catspoiler.org> Date: Mon, 21 Apr 2003 23:20:11 -0700 (PDT) From: Don Lewis To: current@FreeBSD.org MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii Subject: LOR in -current NFS client code + possible patch X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Apr 2003 06:20:18 -0000 I had something wedge up in the NFS client code in a recent version of -current. 5.0-CURRENT #61: Sat Apr 19 00:36:17 PDT 2003 I got the console message: Apr 21 20:52:07 scratch kernel: nfs server mousie:/home: not responding TCP connections continued to work, and another client was still able to access the server, so the problem was definitely in the client code. When I attempted to kill the process that seemed to be responsible for wedging NFS, I got a lock order reversal message: Apr 21 20:54:33 scratch kernel: lock order reversal Apr 21 20:54:33 scratch kernel: 1st 0xc893ab68 vnode interlock (vnode interlock) @ /usr/src/sys/nfsclient/nfs_vnops.c:2792 Apr 21 20:54:33 scratch kernel: 2nd 0xc69f4248 process lock (process lock) @ /us r/src/sys/nfsclient/nfs_socket.c:1239 Apr 21 20:54:33 scratch kernel: Stack backtrace: The backtrace (copied by hand): witness_lock() _mtx_lock_flags() nfs_sigintr() at nfs_sigintr+0x77 nfs_flush() at nfs_flush+0x763 nfs_close() at nfs_close+0x7a vn_close() vn_closefile() fdrop_locked() fdrop() closef() close() I don't know what caused the original problem, but the lock order reversal is caused by nfs_flush() calling nfs_sigintr() while holding a vnode interlock, and nfs_sigintr() calls PROC_LOCK(). It looks to me like the following patch is the proper fix. There is another call to nfs_sigintr() in nfs_flush(), but it looks like BUF_TIMELOCK() must release the interlock in the error case. Comments? Index: nfs_vnops.c =================================================================== RCS file: /home/ncvs/src/sys/nfsclient/nfs_vnops.c,v retrieving revision 1.202 diff -u -r1.202 nfs_vnops.c --- nfs_vnops.c 31 Mar 2003 23:26:10 -0000 1.202 +++ nfs_vnops.c 22 Apr 2003 06:03:28 -0000 @@ -2838,8 +2842,8 @@ error = msleep((caddr_t)&vp->v_numoutput, VI_MTX(vp), slpflag | (PRIBIO + 1), "nfsfsync", slptimeo); if (error) { + VI_UNLOCK(vp); if (nfs_sigintr(nmp, NULL, td)) { - VI_UNLOCK(vp); error = EINTR; goto done; } @@ -2847,6 +2851,7 @@ slpflag = 0; slptimeo = 2 * hz; } + VI_LOCK(vp); } } if (!TAILQ_EMPTY(&vp->v_dirtyblkhd) && commit) {