From owner-freebsd-stable@FreeBSD.ORG Sun Aug 27 20:28:12 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 7C4A016A4DA for ; Sun, 27 Aug 2006 20:28:12 +0000 (UTC) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (core.byshenk.net [62.58.73.230]) by mx1.FreeBSD.org (Postfix) with ESMTP id B34C643D4C for ; Sun, 27 Aug 2006 20:28:11 +0000 (GMT) (envelope-from byshenknet@byshenk.net) Received: from core.byshenk.net (localhost.aoes.com [127.0.0.1]) by core.byshenk.net (8.13.6/8.13.6) with ESMTP id k7RKS4aR075768; Sun, 27 Aug 2006 22:28:04 +0200 (CEST) (envelope-from byshenknet@core.byshenk.net) Received: (from byshenknet@localhost) by core.byshenk.net (8.13.6/8.13.6/Submit) id k7RKS40F075767; Sun, 27 Aug 2006 22:28:04 +0200 (CEST) (envelope-from byshenknet) Date: Sun, 27 Aug 2006 22:28:04 +0200 From: Greg Byshenk To: Michael Abbott Message-ID: <20060827202803.GP633@core.byshenk.net> References: <20060827102135.B49194@saturn.araneidae.co.uk> <20060827135434.GH79046@deviant.kiev.zoral.com.ua> <20060827183903.G52383@saturn.araneidae.co.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20060827183903.G52383@saturn.araneidae.co.uk> User-Agent: Mutt/1.4.2.2i X-Spam-Status: No, score=0.0 required=5.0 tests=UNPARSEABLE_RELAY autolearn=failed version=3.1.4 X-Spam-Checker-Version: SpamAssassin 3.1.4 (2006-07-25) on core.byshenk.net Cc: freebsd-stable@freebsd.org Subject: Re: NFS locking: lockf freezes (rpc.lockd problem?) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 27 Aug 2006 20:28:12 -0000 On Sun, Aug 27, 2006 at 07:17:34PM +0000, Michael Abbott wrote: > On Sun, 27 Aug 2006, Kostik Belousov wrote: > >Make sure that rpc.statd is running. > Yep. Took me some while to figure that one out, but the first lockf test > failed without that. [...] > As for the other test, let's have a look. Here we are before the test > (NFS server, 4.11, is saturn, test machine, 6.1, is venus): > saturn$ ps auxww | grep rpc\\. > root 48917 0.0 0.1 980 640 ?? Is 7:56am 0:00.01 rpc.lockd > root 115 0.0 0.1 263096 536 ?? Is 18Aug06 0:00.00 rpc.statd [...] > Well, how odd: as soon as I start the test process 515 on venus goes away. > Now to wait for it to fail... (doesn't take too long): [...] > In conclusion: I agree with Greg Byshenk that the NFS server is bound to > be the one at fault, BUT, is this "freeze until reboot" behaviour really > what we want? I remain astonished (and irritated) that `kill -9` doesn't > work! The problem here is that the process is waiting for somthing, and thus not listening to signals (including your 'kill'). I'm not an expert on this, but my first guess would be that saturn (your server) is offering something that it can't deliver. That is, the client asks the server "can you do X?", and the server says "yes I can", so the client says "do X" and waits -- and the server never does it. Or alternatively (based on your rpc.statd dying), rpc.lockd on your client is trying to use rpc.statd to communicate with your server. And it starts successfully, but then rpc.statd dies (for some reason) and your lock ends up waiting forever for it to answer. I would recommend starting both rpc.lockd and rpc.statd with the '-d' flag, to see if this provides any information as to what is going on. There may well be a bug somewhere, but you need to find where it is. I suspect that it is not actually in rpc.statd, as nothing in the source has changed since January 2005. An alternative would be to update to RELENG_6 (or at least RELENG_6_1) and then try again. -- greg byshenk - gbyshenk@byshenk.net - Leiden, NL