From owner-freebsd-current@FreeBSD.ORG Thu Jan 15 15:38:16 2004 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 343A616A4CF for ; Thu, 15 Jan 2004 15:38:16 -0800 (PST) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id D66C343D73 for ; Thu, 15 Jan 2004 15:37:57 -0800 (PST) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.10/8.12.10) with ESMTP id i0FNa7Ud078782; Thu, 15 Jan 2004 18:36:07 -0500 (EST) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)i0FNa7Jk078779; Thu, 15 Jan 2004 18:36:07 -0500 (EST) (envelope-from robert@fledge.watson.org) Date: Thu, 15 Jan 2004 18:36:06 -0500 (EST) From: Robert Watson X-Sender: robert@fledge.watson.org To: Dan Nelson In-Reply-To: <20040115230708.GB53031@dan.emsphone.com> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: freebsd-current@freebsd.org cc: Eric van Gyzen Subject: Re: rpc.lockd resource starvation X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 15 Jan 2004 23:38:16 -0000 On Thu, 15 Jan 2004, Dan Nelson wrote: > I think you just told me why my two busiest NFS servers had to be > rebooted a few months ago (one with 440 days of uptime :( ). Does the > mount fail with "mount: Can't assign requested address"? If so, it also > happens on 4.x servers. Currently, they have 214 and 109 open reserved > ports (after 102 and 73 days uptime, respectively), and I'm betting > there are no more than 5 files actually locked on either system. I > wonder if it's just not closing sockets when it's done with them? There are a number of "known bugs/features" in rpc.lockd, but I have to say that this one is new to me. The issues I know about are: (1) There appear to be problems relating to rpc.lockd and/or rpc.statd following client reboots. I've experienced problems between a Solaris file server and a FreeBSD NFSv3 client using locking wherein a client crash/reboot doesn't release the locks. It could be our rpc.statd simply doesn't work...? (2) There is a known problem involving aborted lock requests -- currently, PCATCH is disabled in the kernel tsleep() in the client, because there's no way to signal to the userspace rpc.lockd that a lock "wasn't wanted afterall". If you add PCATCH back, every time you abort a lock request with a signal you leak a lock. The kernel/userspace protocol needs to be expanded a bit so that the abort can be sent to userspace, and userspace then needs to know what to do about it. (3) There seems to be a general failure tolerance issue associated with situations when rpc.lockd gets back a lock acknowledgement for a lock it didn't request. For safety, it should really release the lock, which would mask (1) and sometimes (2). (4) There seem to be some issues with waking up processes waiting on lock requests when the lock arrives. I sent an e-mail about this a while back, and should dig it up along with my lock testing scenarios and document this better. (5) I think there's also a problem with leaking locks when an application requests the lock using O_NONBLOCK; the request is sent out, but bad things happen if the lock is granted. (6) I believe there was also some problem relating to a series of processes waiting for the same lock on the same client, and not all of them eventually getting the lock. I'll dig through my past e-mail and see if I can't dig up the details. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Senior Research Scientist, McAfee Research