From owner-freebsd-current@FreeBSD.ORG Wed May 14 14:37:45 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 8ED0137B401; Wed, 14 May 2003 14:37:45 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8C99B43F75; Wed, 14 May 2003 14:37:40 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h4ELbROn013577; Wed, 14 May 2003 17:37:27 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h4ELbQc6013574; Wed, 14 May 2003 17:37:27 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Wed, 14 May 2003 17:37:26 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: Don Lewis In-Reply-To: <200305140545.h4E5jWM7052038@gw.catspoiler.org> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: bsder@allcaps.org cc: alfred@FreeBSD.org cc: current@FreeBSD.org Subject: Re: rpc.lockd spinning; much breakage X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list Reply-To: Robert Watson List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 14 May 2003 21:37:46 -0000 On Tue, 13 May 2003, Don Lewis wrote: > On 13 May, Robert Watson wrote: > > On Tue, 13 May 2003, Don Lewis wrote: > > > So that change sounds like a winner for that issue. This leaves the > > problem of getting EACCES back for locks contended by an NFS client > > against consumers directly on the server, rather than the client retrying. > > Non-blocking, blocking, or both? What about the case if the lock is > held by another client? Here's a table of cases; the columns identify where the source of contention for the lock is: Blocking Non-blocking Same client blocks EACCES Different client blocks EACCES Server blocks EACCES In these tests, I'm running with a vanilla rpc.lockd on the server and clients, following my earlier commit of the wakeup fix. With the vanilla tree as it stands, however, blocking locks are often "lost" without the book-keeping patch from Andrew Lentvorski applied. With that change, appear to get lost less when acting between processes on the same client on the same lock. > I don't know if the the client will retry in the blocking case or if the > server side will have to grow the code to poll any local locks that it > might encounter. > Based on earlier experience with the wakeups getting "lost", it sounds like the re-polling takes place once every ten seconds on the client for blocking locks. Speaking of re-polling, here's another bug: Open two pty's on the NFS client. On pty1, grab and hold an exclusive lock on a file; sleep. On pty2, do a blocking lock attempt on open, but Ctrl-C the process before the pty1 process wakes up, meaning that the lock attempt is effectively aborted. Now kill the first process, releasing the lock, and attempt to grab the lock on the file: you'll hang forever. The client rpc.lockd has left a blocking lock request registered with the server, but never released that lock for the now missing process. Example pty1: crash1:/tmp> ./locktest nocreate openexlock nonblock noflock test 10 1107 open(test, 36, 0666) Wed May 14 17:28:41 2003 1107 open() returns Wed May 14 17:28:41 2003 1107 sleep(10) Wed May 14 17:28:41 2003 1107 sleep() returns Wed May 14 17:28:51 2003 Example pty2: crash1:/tmp> ./locktest nocreate openexlock block noflock test 0 1108 open(test, 32, 0666) Wed May 14 17:28:43 2003 ^C crash1:/tmp> ./locktest nocreate openexlock block noflock test 0 1113 open(test, 32, 0666) Wed May 14 17:30:52 2003 It looks like rpc.statd on the client needs to remember that it requested the lock, and when it discovers that the process requesting the lock has evaporated, it should immediately release the lock on its behalf. It's not clear to me how that should be accomplished: perhaps when it tries to wake up the process and discovers it is missing, it should do it, or if the lock attempt is aborted early due to a signal, a further message should be sent from the kernel to the userland rpc.lockd to notify it that the lock instance is no longer of interest. Note that if we're only using the pid to identify a process, not a pid and some sort of generation number, there's the potential for pid reuse and a resulting race. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories