From owner-freebsd-current@FreeBSD.ORG Fri May 27 20:43:06 2005 Return-Path: X-Original-To: freebsd-current@freebsd.org Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A12AC16A41C for ; Fri, 27 May 2005 20:43:06 +0000 (GMT) (envelope-from PeterJeremy@optushome.com.au) Received: from mail21.syd.optusnet.com.au (mail21.syd.optusnet.com.au [211.29.133.158]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1653043D1D for ; Fri, 27 May 2005 20:43:05 +0000 (GMT) (envelope-from PeterJeremy@optushome.com.au) Received: from cirb503493.alcatel.com.au (c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229]) by mail21.syd.optusnet.com.au (8.12.11/8.12.11) with ESMTP id j4RKh3we005533 (version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO); Sat, 28 May 2005 06:43:03 +1000 Received: from cirb503493.alcatel.com.au (localhost.alcatel.com.au [127.0.0.1]) by cirb503493.alcatel.com.au (8.12.10/8.12.10) with ESMTP id j4RKh2Rx022491; Sat, 28 May 2005 06:43:02 +1000 (EST) (envelope-from pjeremy@cirb503493.alcatel.com.au) Received: (from pjeremy@localhost) by cirb503493.alcatel.com.au (8.12.10/8.12.9/Submit) id j4RKh2vu022490; Sat, 28 May 2005 06:43:02 +1000 (EST) (envelope-from pjeremy) Date: Sat, 28 May 2005 06:43:02 +1000 From: Peter Jeremy To: Ted Faber Message-ID: <20050527204302.GC18914@cirb503493.alcatel.com.au> References: <20050526001806.GA1008@pun.isi.edu> <20050526080928.GE12640@cirb503493.alcatel.com.au> <20050526160846.GA6851@pun.isi.edu> <20050526203243.GB1055@pun.isi.edu> <20050527083734.GA18696@cirb503493.alcatel.com.au> <20050527152752.GA10069@pun.isi.edu> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20050527152752.GA10069@pun.isi.edu> User-Agent: Mutt/1.4.2i Cc: freebsd-current@freebsd.org Subject: Re: hard deadlock(?) on -current; some debugging info, need help X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 May 2005 20:43:06 -0000 On Fri, 2005-May-27 08:27:52 -0700, Ted Faber wrote: >work something out, bit I do have a laptop running in the same >environment (and with a kernel from the same source) that does not >exhibit this problem. That's a useful snippet. I missed the bit about same source before. What are the differences between the systems (including kernel compilation options)? That might provide a clue as to the underlying problem. Have you tried running the same sort of workload on your laptop? Is is feasible to run one of the kernels on both systems? >> It might be useful to know some more details about that NFS mount >> (fsid 0x0600ff07). Can you tell us the mount parameters and what the >> server is (OS type). > >Most o fthe nfs filesystems are automounted. I'm on the machine now, so >I can't look at debugger output, but I can tell you that most of the NFS >mounts that I can imagine either psi or bash looking at are automounted. >The mount parameters are: timeo=8,retrans=9,intr I didn't notice amd before. If you can't avoid NFS, any chance of (at least temporarily) hard-mounting all the relevant filesystems and disabling amd? amd acts as an NFS server to detect activity on the automount filesystems. Both the backtraces you posted show that one process is blocked on an NFS request and amd is blocked on ufs. The locks on the second backtrace show that the bash waiting on an NFS request is a root of the deadlock tree. If that NFS request is supposed to be handled by amd, you close the deadlock cycle. Also, if your mounts are interruptable, that nfsreq sleep is interruptable - you could try dropping into DDB, finding the process sleeping on nfsreq and killing it ("kill signal_number pid" in ddb, no '-' on the signal number), then using "cont" to recover. That might break the deadlock. >For completeness, the server is a Solaris box. Don't laugh: >boreas:~$ uname -a >SunOS boreas.isi.edu 5.9 Generic_117171-12 sun4u sparc Sun's NFS implementations should be trustable :-). > If moving the config does not solve it, is there some output from >teh debugger I should get about the file system? I can't see any DDB command to dump the mount table and doing it manually would be painful. Have you managed to get a crash dump? (If not, what does "call doadump" do?) Alternatively, have you ever tried running remote GDB? > It really helps to talk these things >out with someone knowledgable. Unfortunately, no-one knowledgable has showed up :-). -- Peter Jeremy