From owner-freebsd-current@FreeBSD.ORG  Fri May 27 20:43:06 2005
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
X-Original-To: freebsd-current@freebsd.org
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id A12AC16A41C
	for <freebsd-current@freebsd.org>; Fri, 27 May 2005 20:43:06 +0000 (GMT)
	(envelope-from PeterJeremy@optushome.com.au)
Received: from mail21.syd.optusnet.com.au (mail21.syd.optusnet.com.au
	[211.29.133.158])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 1653043D1D
	for <freebsd-current@freebsd.org>; Fri, 27 May 2005 20:43:05 +0000 (GMT)
	(envelope-from PeterJeremy@optushome.com.au)
Received: from cirb503493.alcatel.com.au
	(c211-30-75-229.belrs2.nsw.optusnet.com.au [211.30.75.229])
	by mail21.syd.optusnet.com.au (8.12.11/8.12.11) with ESMTP id
	j4RKh3we005533
	(version=TLSv1/SSLv3 cipher=EDH-RSA-DES-CBC3-SHA bits=168 verify=NO);
	Sat, 28 May 2005 06:43:03 +1000
Received: from cirb503493.alcatel.com.au (localhost.alcatel.com.au [127.0.0.1])
	by cirb503493.alcatel.com.au (8.12.10/8.12.10) with ESMTP id
	j4RKh2Rx022491; Sat, 28 May 2005 06:43:02 +1000 (EST)
	(envelope-from pjeremy@cirb503493.alcatel.com.au)
Received: (from pjeremy@localhost)
	by cirb503493.alcatel.com.au (8.12.10/8.12.9/Submit) id j4RKh2vu022490; 
	Sat, 28 May 2005 06:43:02 +1000 (EST) (envelope-from pjeremy)
Date: Sat, 28 May 2005 06:43:02 +1000
From: Peter Jeremy <PeterJeremy@optushome.com.au>
To: Ted Faber <faber@isi.edu>
Message-ID: <20050527204302.GC18914@cirb503493.alcatel.com.au>
References: <20050526001806.GA1008@pun.isi.edu>
	<20050526080928.GE12640@cirb503493.alcatel.com.au>
	<20050526160846.GA6851@pun.isi.edu>
	<20050526203243.GB1055@pun.isi.edu>
	<20050527083734.GA18696@cirb503493.alcatel.com.au>
	<20050527152752.GA10069@pun.isi.edu>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20050527152752.GA10069@pun.isi.edu>
User-Agent: Mutt/1.4.2i
Cc: freebsd-current@freebsd.org
Subject: Re: hard deadlock(?) on -current; some debugging info, need help
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 May 2005 20:43:06 -0000

On Fri, 2005-May-27 08:27:52 -0700, Ted Faber wrote:
>work something out, bit I do have a laptop running in the same
>environment (and with a kernel from the same source) that does not
>exhibit this problem.

That's a useful snippet.  I missed the bit about same source before.
What are the differences between the systems (including kernel
compilation options)?  That might provide a clue as to the underlying
problem.  Have you tried running the same sort of workload on your
laptop?  Is is feasible to run one of the kernels on both systems?
 
>> It might be useful to know some more details about that NFS mount
>> (fsid 0x0600ff07).  Can you tell us the mount parameters and what the
>> server is (OS type).
>
>Most o fthe nfs filesystems are automounted.  I'm on the machine now, so
>I can't look at debugger output, but I can tell you that most of the NFS
>mounts that I can imagine either psi or bash looking at are automounted.
>The mount parameters are: timeo=8,retrans=9,intr

I didn't notice amd before.  If you can't avoid NFS, any chance of (at
least temporarily) hard-mounting all the relevant filesystems and
disabling amd?  amd acts as an NFS server to detect activity on the
automount filesystems.  Both the backtraces you posted show that one
process is blocked on an NFS request and amd is blocked on ufs.  The
locks on the second backtrace show that the bash waiting on an NFS
request is a root of the deadlock tree.  If that NFS request is
supposed to be handled by amd, you close the deadlock cycle.

Also, if your mounts are interruptable, that nfsreq sleep is
interruptable - you could try dropping into DDB, finding the process
sleeping on nfsreq and killing it ("kill signal_number pid" in ddb,
no '-' on the signal number), then using "cont" to recover.  That
might break the deadlock.

>For completeness, the server is a Solaris box.  Don't laugh:
>boreas:~$ uname -a
>SunOS boreas.isi.edu 5.9 Generic_117171-12 sun4u sparc

Sun's NFS implementations should be trustable :-).

> If moving the config does not solve it, is there some output from
>teh debugger I should get about the file system?

I can't see any DDB command to dump the mount table and doing it
manually would be painful.  Have you managed to get a crash dump?  (If
not, what does "call doadump" do?)  Alternatively, have you ever tried
running remote GDB?

>  It really helps to talk these things
>out with someone knowledgable.

Unfortunately, no-one knowledgable has showed up :-).

-- 
Peter Jeremy