From owner-freebsd-current@FreeBSD.ORG Mon Nov 10 09:44:01 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 39A7316A4CE; Mon, 10 Nov 2003 09:44:01 -0800 (PST) Received: from spider.deepcore.dk (cpe.atm2-0-53484.0x50a6c9a6.abnxx9.customer.tele.dk [80.166.201.166]) by mx1.FreeBSD.org (Postfix) with ESMTP id 066CF43FCB; Mon, 10 Nov 2003 09:44:00 -0800 (PST) (envelope-from sos@spider.deepcore.dk) Received: from spider.deepcore.dk (localhost [127.0.0.1]) by spider.deepcore.dk (8.12.10/8.12.10) with ESMTP id hAAHieEQ008463; Mon, 10 Nov 2003 18:44:40 +0100 (CET) (envelope-from sos@spider.deepcore.dk) Received: (from sos@localhost) by spider.deepcore.dk (8.12.10/8.12.10/Submit) id hAAHiefC008462; Mon, 10 Nov 2003 18:44:40 +0100 (CET) (envelope-from sos) From: Soren Schmidt Message-Id: <200311101744.hAAHiefC008462@spider.deepcore.dk> In-Reply-To: To: Robert Watson Date: Mon, 10 Nov 2003 18:44:40 +0100 (CET) X-Mailer: ELM [version 2.4ME+ PL99f (25)] MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=ISO-8859-1 X-mail-scanned: by DeepCore Virus & Spam killer v1.3 cc: current@freebsd.org Subject: Re: Still getting NFS client locking up X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 10 Nov 2003 17:44:01 -0000 It seems Robert Watson wrote: > How fast are your systems, speaking of which? I live in the world of > 300-500 mhz machines at work, and 300-800 mhz boxes at home. If you're > using multi-ghz boxes, that could well be the distinguishing factor > between our configurations... Server is 533MhzVIA C3, clients everything from 300Mhz PII to 2.6G P4. > Ok, here's the strategy I was planning to take once I could reproduce it: > > (1) Attempt to further narrow down responsibility to client/server. In > particular, see if an apparent hang on one client affects the other > clients. For me its just the server end that fails, I've not seen the client hang. > (2) Investigate Soren's report that killing and restarting nfsd on the > server would clear the hang. Yups, that works, in fact I have that in my crontab now every minute to keep NFS from hosing my setup here. NOTE: I also still need to ifconfig done/up my interfaces on some boxes or the netstack will freeze (again done every minute in crontab). However when NFS locks up it seems totatlly unrelated, ie all other network traffic works... > (3) Look at stack traces of involved processes on both the client and > server: in particular, look at traces for any client blocked in NFS, > any nfsiod processes on the client, and the nfsd processes on the > server. Also look at the wait channels on clients and servers for > these processes. Particularly interested in whether nfsd processes > are blocked trying to grab locks. Ok, will do.. > (4) Look at netstat information for NFS sockets, in particular, if the > buffers are full, or not being drained. In particular, on the server, > is the input queue not being drained by nfsd worker threads? Netstat doesn't seem to give any hints or even usefull info here, any special cmdøs you want the output from ? > (5) Try backing out src/sys/nfsserver/nfs_serv.c:1.137, which removed > another deadlock problem, but did change locking behavior in the NFS > server. No change already tried. > (6) Look at packet traces between the client and server with ethereal, > which has pretty good NFS decoding. Is the client retransmitting an > RPC to the server and the server just isn't responding, or is the > client failing to transmit? At the point of the hang, what sorts of > RPCs are outstanding to the server? In the past, we've seen "apparent > hangs" when some or another more obscure unusual error case on the NFS > server fails to respond to an RPC, which causes the client to "wait > forever". I can try that easily, I'll get a trace to you later tonight... > Things to look for: normally, idle nfsd and nfsiod processes have a WCHAN > of "-" (ps -lax), which indicates they're blocked waiting for some event > to kick them off. If you see nfsd processes "hung" in another state, it's > a good sign we've identified a server problem. In the nfs client > processes, "nfsrcvlk" typically indicates a process has sent out an RPC > and is now waiting on a response. I see the idle '-' wchan here when things go bad IIRC... -Søren