From owner-freebsd-current@FreeBSD.ORG  Mon Nov 10 09:44:01 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 39A7316A4CE; Mon, 10 Nov 2003 09:44:01 -0800 (PST)
Received: from spider.deepcore.dk
	(cpe.atm2-0-53484.0x50a6c9a6.abnxx9.customer.tele.dk [80.166.201.166])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 066CF43FCB; Mon, 10 Nov 2003 09:44:00 -0800 (PST)
	(envelope-from sos@spider.deepcore.dk)
Received: from spider.deepcore.dk (localhost [127.0.0.1])
	by spider.deepcore.dk (8.12.10/8.12.10) with ESMTP id hAAHieEQ008463;
	Mon, 10 Nov 2003 18:44:40 +0100 (CET)
	(envelope-from sos@spider.deepcore.dk)
Received: (from sos@localhost)
	by spider.deepcore.dk (8.12.10/8.12.10/Submit) id hAAHiefC008462;
	Mon, 10 Nov 2003 18:44:40 +0100 (CET)
	(envelope-from sos)
From: Soren Schmidt <sos@spider.deepcore.dk>
Message-Id: <200311101744.hAAHiefC008462@spider.deepcore.dk>
In-Reply-To: <Pine.NEB.3.96L.1031110111305.51440G-100000@fledge.watson.org>
To: Robert Watson <rwatson@freebsd.org>
Date: Mon, 10 Nov 2003 18:44:40 +0100 (CET)
X-Mailer: ELM [version 2.4ME+ PL99f (25)]
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=ISO-8859-1
X-mail-scanned: by DeepCore Virus & Spam killer v1.3
cc: current@freebsd.org
Subject: Re: Still getting NFS client locking up
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Mon, 10 Nov 2003 17:44:01 -0000

It seems Robert Watson wrote:
> How fast are your systems, speaking of which?  I live in the world of
> 300-500 mhz machines at work, and 300-800 mhz boxes at home.  If you're
> using multi-ghz boxes, that could well be the distinguishing factor
> between our configurations...

Server is 533MhzVIA C3, clients everything from 300Mhz PII to 2.6G P4.

> Ok, here's the strategy I was planning to take once I could reproduce it:
> 
> (1) Attempt to further narrow down responsibility to client/server.  In
>     particular, see if an apparent hang on one client affects the other
>     clients. 

For me its just the server end that fails, I've not seen the client hang.

> (2) Investigate Soren's report that killing and restarting nfsd on the
>     server would clear the hang.

Yups, that works, in fact I have that in my crontab now every minute
to keep NFS from hosing my setup here.
NOTE: I also still need to ifconfig done/up my interfaces on some
boxes or the netstack will freeze (again done every minute in crontab).
However when NFS locks up it seems totatlly unrelated, ie all other 
network traffic works...

> (3) Look at stack traces of involved processes on both the client and
>     server: in particular, look at traces for any client blocked in NFS,
>     any nfsiod processes on the client, and the nfsd processes on the
>     server.  Also look at the wait channels on clients and servers for
>     these processes.  Particularly interested in whether nfsd processes
>     are blocked trying to grab locks.

Ok, will do..

> (4) Look at netstat information for NFS sockets, in particular, if the
>     buffers are full, or not being drained.  In particular, on the server,
>     is the input queue not being drained by nfsd worker threads? 

Netstat doesn't seem to give any hints or even usefull info here, 
any special cmdøs you want the output from ?

> (5) Try backing out src/sys/nfsserver/nfs_serv.c:1.137, which removed
>     another deadlock problem, but did change locking behavior in the NFS
>     server.

No change already tried.

> (6) Look at packet traces between the client and server with ethereal,
>     which has pretty good NFS decoding.  Is the client retransmitting an
>     RPC to the server and the server just isn't responding, or is the
>     client failing to transmit?  At the point of the hang, what sorts of
>     RPCs are outstanding to the server?  In the past, we've seen "apparent
>     hangs" when some or another more obscure unusual error case on the NFS
>     server fails to respond to an RPC, which causes the client to "wait
>     forever".

I can try that easily, I'll get a trace to you later tonight...

> Things to look for: normally, idle nfsd and nfsiod processes have a WCHAN
> of "-" (ps -lax), which indicates they're blocked waiting for some event
> to kick them off.  If you see nfsd processes "hung" in another state, it's
> a good sign we've identified a server problem.  In the nfs client
> processes, "nfsrcvlk" typically indicates a process has sent out an RPC
> and is now waiting on a response.

I see the idle '-' wchan here when things go bad IIRC...

-Søren