Date: Wed, 14 May 2003 19:21:42 -0700 (PDT) From: Don Lewis <truckman@FreeBSD.org> To: robert@fledge.watson.org Cc: current@FreeBSD.org Subject: Re: rpc.lockd spinning; much breakage Message-ID: <200305150221.h4F2LgM7054256@gw.catspoiler.org> In-Reply-To: <Pine.NEB.3.96L.1030514095118.8018B-100000@fledge.watson.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 14 May, Robert Watson wrote: > > On Tue, 13 May 2003, Don Lewis wrote: >> I don't know if the the client will retry in the blocking case or if the >> server side will have to grow the code to poll any local locks that it > > might encounter. > > > Based on earlier experience with the wakeups getting "lost", it sounds > like the re-polling takes place once every ten seconds on the client for > blocking locks. That seems makes sense. It looks like the client side more or less just tosses the "blocked" response and waits for the grant message to arrive. I guess it periodically polls while it waits. > Speaking of re-polling, here's another bug: Open two pty's on the NFS > client. On pty1, grab and hold an exclusive lock on a file; sleep. On > pty2, do a blocking lock attempt on open, but Ctrl-C the process before > the pty1 process wakes up, meaning that the lock attempt is effectively > aborted. Now kill the first process, releasing the lock, and attempt to > grab the lock on the file: you'll hang forever. The client rpc.lockd has > left a blocking lock request registered with the server, but never > released that lock for the now missing process. > It looks like rpc.statd on the client needs to remember that it requested > the lock, and when it discovers that the process requesting the lock has > evaporated, it should immediately release the lock on its behalf. It's > not clear to me how that should be accomplished: perhaps when it tries to > wake up the process and discovers it is missing, it should do it, or if > the lock attempt is aborted early due to a signal, a further message > should be sent from the kernel to the userland rpc.lockd to notify it that > the lock instance is no longer of interest. Note that if we're only using > the pid to identify a process, not a pid and some sort of generation > number, there's the potential for pid reuse and a resulting race. I saw something in the code about a cancel message (nlm4_cancel, nlm4_cancel_msg). I think what is supposed to happen is that when process #2 is killed the descriptor waiting for the lock will closed which should get rid of its lock request. rpc.lockd on the client should notice this and send a cancel message to the server. When process #1 releases the lock, the second lock will no longer be queued on the the server and process #3 should be able to grab the lock. This bug could be in the client rpc.lockd, the client kernel, or the server rpc.lockd.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200305150221.h4F2LgM7054256>