From owner-freebsd-hackers Fri Nov 13 14:38:21 1998 Return-Path: Received: (from majordom@localhost) by hub.freebsd.org (8.8.8/8.8.8) id OAA12974 for freebsd-hackers-outgoing; Fri, 13 Nov 1998 14:38:21 -0800 (PST) (envelope-from owner-freebsd-hackers@FreeBSD.ORG) Received: from wrath.cs.utah.edu (wrath.cs.utah.edu [155.99.198.100]) by hub.freebsd.org (8.8.8/8.8.8) with ESMTP id OAA12966 for ; Fri, 13 Nov 1998 14:38:19 -0800 (PST) (envelope-from danderse@cs.utah.edu) Received: from torrey.cs.utah.edu (torrey.cs.utah.edu [155.99.212.91]) by wrath.cs.utah.edu (8.8.8/8.8.8) with ESMTP id PAA11099; Fri, 13 Nov 1998 15:37:58 -0700 (MST) Received: (from danderse@localhost) by torrey.cs.utah.edu (8.9.1/8.9.1) id PAA01040; Fri, 13 Nov 1998 15:37:58 -0700 (MST) (envelope-from danderse@cs.utah.edu) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Fri, 13 Nov 1998 15:37:58 -0700 (MST) From: "David G. Andersen" To: freebsd-hackers@FreeBSD.ORG Cc: mike@fast.cs.utah.edu, sclawson@cs.utah.edu, danderse@cs.utah.edu Subject: amd/NFS INTR hang - more details. X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <13900.45689.285484.668273@torrey.cs.utah.edu> Sender: owner-freebsd-hackers@FreeBSD.ORG Precedence: bulk X-Loop: FreeBSD.ORG On the topic of the earlier mentioned hang we've been tracking down; 3.0-CURRENT, on dual pII-350 machines. We're running an older version of amd, but the problem occurs with the new version more frequently. We can reliably hang the machine by: open a file over NFS write some data to the file (how much appears irrelevant) close the file descriptor while doing so, ctrl-C (SIGINTR) the process which is closing the file descriptor. (This likely explains the prevalence of hangs in Netscape and Xemacs, both of which use quite a few signals. We've replicated it on a machine running nothing but the bare essentials, and nfsiod) The kernel still responds to pings and such, but no userland executes after the hang. If we force the kernel to panic, and examine the crashdump, we find that it's hung in a tsleep call, in vinvalbuf (sys/kern/vfs_subr.c) while (vp->v_numoutput) { vp->v_flag |= VBWAIT; => tsleep((caddr_t)&vp->v_numoutput, slpflag | (PRIBIO + 1), "vinvlbuf", slptimeo); } Looking at it, it appears that: (Quoting shamelessly from Mike Hibler who peeked at it also) The test program is stuck in this loop in vinvalbuf because there is a SIGINTR pending. This causes tsleep to return immediately (without sleeping) with the return value EINTR or ERESTART but they aren't checking the return value! Hence, it spins forever in this loop because... Meanwhile one of the pending nfsbiod's has been awakened because its reply to the write request has arrived, but it never gets to run. The other three nfsbiods are blocked because only one biod can be in the socket receive at a time. And until the biods return, v_numoutput won't be decremented. It works with no nfsbiods because the test program does all the buffer writes itself so by the time it gets to vinvalbuf, v_numoutput is 0. Unfortunately, I don't know what the right behavior is off the top of my head. This appears to be a FreeBSDism that isn't in our code or NetBSD. Any thoughts / suggested fixes would be appreciated. Interestingly, this appears to be at least slightly orthogonal to the other person reporting NFS problems whereby processes would get locked in "D" state; with the nfsiod's disabled, we're also seeing that problem, but haven't looked into it yet. -Dave -- work: danderse@cs.utah.edu me: angio@pobox.com University of Utah http://www.angio.net/ Department of Computer Science To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-hackers" in the body of the message