Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 May 2003 19:21:42 -0700 (PDT)
From:      Don Lewis <truckman@FreeBSD.org>
To:        robert@fledge.watson.org
Cc:        current@FreeBSD.org
Subject:   Re: rpc.lockd spinning; much breakage
Message-ID:  <200305150221.h4F2LgM7054256@gw.catspoiler.org>
In-Reply-To: <Pine.NEB.3.96L.1030514095118.8018B-100000@fledge.watson.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 14 May, Robert Watson wrote:
> 
> On Tue, 13 May 2003, Don Lewis wrote:

>> I don't know if the the client will retry in the blocking case or if the
>> server side will have to grow the code to poll any local locks that it >
> might encounter.  >
> 
> Based on earlier experience with the wakeups getting "lost", it sounds
> like the re-polling takes place once every ten seconds on the client for
> blocking locks.

That seems makes sense.  It looks like the client side more or less just
tosses the "blocked" response and waits for the grant message to arrive.
I guess it periodically polls while it waits.

> Speaking of re-polling, here's another bug:  Open two pty's on the NFS
> client.  On pty1, grab and hold an exclusive lock on a file; sleep.  On
> pty2, do a blocking lock attempt on open, but Ctrl-C the process before
> the pty1 process wakes up, meaning that the lock attempt is effectively
> aborted.  Now kill the first process, releasing the lock, and attempt to
> grab the lock on the file: you'll hang forever.  The client rpc.lockd has
> left a blocking lock request registered with the server, but never
> released that lock for the now missing process.

> It looks like rpc.statd on the client needs to remember that it requested
> the lock, and when it discovers that the process requesting the lock has
> evaporated, it should immediately release the lock on its behalf.  It's
> not clear to me how that should be accomplished: perhaps when it tries to
> wake up the process and discovers it is missing, it should do it, or if
> the lock attempt is aborted early due to a signal, a further message
> should be sent from the kernel to the userland rpc.lockd to notify it that
> the lock instance is no longer of interest.  Note that if we're only using
> the pid to identify a process, not a pid and some sort of generation
> number, there's the potential for pid reuse and a resulting race. 

I saw something in the code about a cancel message (nlm4_cancel,
nlm4_cancel_msg). I think what is supposed to happen is that when
process #2 is killed the descriptor waiting for the lock will closed
which should get rid of its lock request.  rpc.lockd on the client
should notice this and send a cancel message to the server. When process
#1 releases the lock, the second lock will no longer be queued on the
the server and process #3 should be able to grab the lock.

This bug could be in the client rpc.lockd, the client kernel, or the
server rpc.lockd.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200305150221.h4F2LgM7054256>