From owner-freebsd-current@FreeBSD.ORG  Wed May 14 14:37:45 2003
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP
	id 8ED0137B401; Wed, 14 May 2003 14:37:45 -0700 (PDT)
Received: from fledge.watson.org (fledge.watson.org [204.156.12.50])
	by mx1.FreeBSD.org (Postfix) with ESMTP
	id 8C99B43F75; Wed, 14 May 2003 14:37:40 -0700 (PDT)
	(envelope-from robert@fledge.watson.org)
Received: from fledge.watson.org (localhost [127.0.0.1])
	by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h4ELbROn013577;
	Wed, 14 May 2003 17:37:27 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Received: from localhost (robert@localhost)h4ELbQc6013574;
	Wed, 14 May 2003 17:37:27 -0400 (EDT)
	(envelope-from robert@fledge.watson.org)
Date: Wed, 14 May 2003 17:37:26 -0400 (EDT)
From: Robert Watson <rwatson@FreeBSD.org>
X-Sender: robert@fledge.watson.org
To: Don Lewis <truckman@FreeBSD.org>
In-Reply-To: <200305140545.h4E5jWM7052038@gw.catspoiler.org>
Message-ID: <Pine.NEB.3.96L.1030514095118.8018B-100000@fledge.watson.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
cc: bsder@allcaps.org
cc: alfred@FreeBSD.org
cc: current@FreeBSD.org
Subject: Re: rpc.lockd spinning; much breakage
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
Reply-To: Robert Watson <robert@fledge.watson.org>
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 May 2003 21:37:46 -0000


On Tue, 13 May 2003, Don Lewis wrote:

> On 13 May, Robert Watson wrote:
> > On Tue, 13 May 2003, Don Lewis wrote:
> 
> > So that change sounds like a winner for that issue.  This leaves the
> > problem of getting EACCES back for locks contended by an NFS client
> > against consumers directly on the server, rather than the client retrying. 
> 
> Non-blocking, blocking, or both?  What about the case if the lock is
> held by another client?

Here's a table of cases; the columns identify where the source of
contention for the lock is:

			Blocking	Non-blocking
Same client		blocks		EACCES
Different client	blocks		EACCES
Server			blocks		EACCES

In these tests, I'm running with a vanilla rpc.lockd on the server and
clients, following my earlier commit of the wakeup fix.  With the vanilla
tree as it stands, however, blocking locks are often "lost" without the
book-keeping patch from Andrew Lentvorski applied.  With that change,
appear to get lost less when acting between processes on the same client
on the same lock.

> I don't know if the the client will retry in the blocking case or if the
> server side will have to grow the code to poll any local locks that it >
might encounter.  >

Based on earlier experience with the wakeups getting "lost", it sounds
like the re-polling takes place once every ten seconds on the client for
blocking locks.

Speaking of re-polling, here's another bug:  Open two pty's on the NFS
client.  On pty1, grab and hold an exclusive lock on a file; sleep.  On
pty2, do a blocking lock attempt on open, but Ctrl-C the process before
the pty1 process wakes up, meaning that the lock attempt is effectively
aborted.  Now kill the first process, releasing the lock, and attempt to
grab the lock on the file: you'll hang forever.  The client rpc.lockd has
left a blocking lock request registered with the server, but never
released that lock for the now missing process.

Example pty1:

crash1:/tmp> ./locktest nocreate openexlock nonblock noflock test 10
1107  open(test, 36, 0666)              Wed May 14 17:28:41 2003
1107  open() returns                    Wed May 14 17:28:41 2003
1107  sleep(10)                         Wed May 14 17:28:41 2003
1107  sleep() returns                   Wed May 14 17:28:51 2003

Example pty2:
crash1:/tmp> ./locktest nocreate openexlock block noflock test 0
1108  open(test, 32, 0666)              Wed May 14 17:28:43 2003
^C

crash1:/tmp> ./locktest nocreate openexlock block noflock test 0
1113  open(test, 32, 0666)              Wed May 14 17:30:52 2003
<hang>

It looks like rpc.statd on the client needs to remember that it requested
the lock, and when it discovers that the process requesting the lock has
evaporated, it should immediately release the lock on its behalf.  It's
not clear to me how that should be accomplished: perhaps when it tries to
wake up the process and discovers it is missing, it should do it, or if
the lock attempt is aborted early due to a signal, a further message
should be sent from the kernel to the userland rpc.lockd to notify it that
the lock instance is no longer of interest.  Note that if we're only using
the pid to identify a process, not a pid and some sort of generation
number, there's the potential for pid reuse and a resulting race. 

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert@fledge.watson.org      Network Associates Laboratories