From owner-freebsd-stable@FreeBSD.ORG Tue Jan 10 23:59:51 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7610F16A41F for ; Tue, 10 Jan 2006 23:59:51 +0000 (GMT) (envelope-from dworkin@village.org) Received: from green-dome.village.org (green-dome.village.org [168.103.84.186]) by mx1.FreeBSD.org (Postfix) with ESMTP id DF1C343D45 for ; Tue, 10 Jan 2006 23:59:50 +0000 (GMT) (envelope-from dworkin@village.org) Received: from green-dome.village.org (localhost.village.org [127.0.0.1]) by green-dome.village.org (8.11.0/8.11.0) with ESMTP id k0ANxnB25342 for ; Tue, 10 Jan 2006 16:59:49 -0700 (MST) Message-Id: <200601102359.k0ANxnB25342@green-dome.village.org> To: freebsd-stable@freebsd.org From: dlm-fb@weaselfish.com Date: Tue, 10 Jan 2006 16:59:49 -0700 Subject: [5.4] getting stuck in nfs_rcvlock() X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Jan 2006 23:59:51 -0000 I'm regularly running into a situation on 5.4 (current as of a few weeks ago) where a multi-threaded process comes to a screeching halt due to a vnode lock dependency chain. In other words, many of the threads have various vnodes locked and are trying to acquire another. When you chase down to the end of it, the one thread that's not waiting for a vnode lock is asleep in nfs_rcvlock(). A typical traceback is: Tracing pid 380 tid 100274 td 0xc5e9c600 sched_switch(c5e9c600,0,2) at sched_switch+0x143 mi_switch(2,0) at mi_switch+0x1ba thread_suspend_check(1) at thread_suspend_check+0x181 sleepq_catch_signals(c70bdda4,0,0,100,c6779a80) at sleepq_catch_signals+0xe9 msleep(c70bdda4,0,153,c08cd4ae,0) at msleep+0x239 nfs_rcvlock(c6cb2300) at nfs_rcvlock+0x63 nfs_reply(c6cb2300,0,f,0,18) at nfs_reply+0x18 nfs_request(c814b948,c6b72700,3,c5e9c600,c51e1780) at nfs_request+0x2f1 nfs_lookup(eba4b9d8) at nfs_lookup+0x53d lookup(eba4bbd4,81a4,c5e9c600,0,c5bb9000) at lookup+0x2cf namei(eba4bbd4) at namei+0x285 vn_open_cred(eba4bbd4,eba4bcd4,180,c51e1780,27) at vn_open_cred+0x5b vn_open(eba4bbd4,eba4bcd4,180,27,c7c71318) at vn_open+0x1e kern_open(c5e9c600,888a406,0,602,180) at kern_open+0xeb open(c5e9c600,eba4bd04,3,0,296) at open+0x18 syscall(80a002f,2f,bdfc002f,8099200,180) at syscall+0x2b3 Xint0x80_syscall() at Xint0x80_syscall+0x1f --- syscall (5, FreeBSD ELF32, open), eip = 0x48108bcb, esp = 0xbe9cedbc, ebp = 0xbe9cede8 --- The obvious interpretation is a non-responding nfs server. However, no complaints of that sort are being generated by the client that's showing the hang. I've chased all sorts of herrings, red or otherwise, and have not yet come up with a good explanation, let alone a fix. It's almost like the request is never getting transmitted. One promising line was that some debug writes showed retransmits being suppressed due to the congestion avoidance window being too small and never opening back up. However, while artificially opening it helped a little bit on throughput, it didn't make the hang go away. An additional data point is that the vnode that the hanging request is referencing is for a directory that we're in the process of creating a file in. There is always at least one other thread from the same process waiting on the vnode to become free so it can create its own file in the same directory. Unfortunately, the application is rather unwieldy to try to split up into multiple processes to see if it's somehow related to threads rather than nfs per se. Does this sort of problem sound at all familiar to anyone? Any help will be greatly appreciated. I'm running out of hair to pull.... Dworkin