Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 17 Mar 2003 22:57:17 -0800
From:      Steve Sizemore <steve@ls.berkeley.edu>
To:        Terry Lambert <tlambert2@mindspring.com>
Cc:        current@freebsd.org
Subject:   Re: NFS file unlocking problem
Message-ID:  <20030318065716.GB99408@math.berkeley.edu>
In-Reply-To: <3E768C47.229C1DF0@mindspring.com>
References:  <Pine.LNX.4.44.0303171255310.15683-100000@mail.allcaps.org> <3E768C47.229C1DF0@mindspring.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Hi, Terry -

On Mon, Mar 17, 2003 at 07:02:31PM -0800, Terry Lambert wrote:
> "Andrew P. Lentvorski, Jr." wrote:
> > being sent is SETLKW which is a blocking wait until lock is granted.  If
> > the server thinks the file is already locked, it will hang *and* that is
> > the proper behavior.
> 
> It is, to ensure FIFO ordering of request grants.  You could also
> implement this as a retry.
> 
> If you do it the first way, you end up potentially deadlocking the
> server shen a single client has badly behaved code that locks against
> itself.  If you do it the second way, you end up with timing dependent
> starvation deadlocks for individual client processes.  Note that the
> first deadlock is normal -- it would happen if the file were local, as
> well... no help for badly written code -- but I mention it as important
> because we are talking about blocking multiple clients.
> 
> I don't know what the process is, but a threaded process can cause
> a deadlock when it should be a grant/upgrade/downgrade of an existing
> lock overlap.  This is because there is no such thing as a thread ID
> in the NFS protocol, and if process IDs are different for different
> threads, and the requests come from the same system ID, then you can
> get a deadlock when none should be present.  To avoid this, either
> manage all locks in an "apartment" or "rental" model (queue all requests
> to a single thread, and have it do the locking by proxy) OR make sure
> that all requests from any thread in a given process in fact are given
> the same proxy process ID on the wire.
> 
> [ ... This last is not likely your problem, but I mention it, in case
>       you are using rfork() or Linux threads ... ]

Thanks for the explanation. If I were a programmer, it would be very
useful. As it is, it's still interesting. I have no way of judging the
quality of the code in question, other than the empirical result that
it works in most cases.

> 
> > What is the result of running this locally on the NFS server and
> > attempting to lock the underlying file?  If rpc.lockd is hanging onto a
> > lock, running that perl script locally on the actual file (not an NFS
> > mounted image of it) should also hang.
> 
> That was my next question, as well: does it happen on a local FS
> as well as an NFS FS?  Personally, I would *NOT* recommend running
> it on the server, but mount a local FS on the client instead; the
> less variables, the better.

Works fine on the "client" on a local file system. Works fine on the
server.

> On the other hand, this is clearly a deadlock that requires an
> existing, conflicting lock -- IFF the you are correct about the
> delayed locking behaviour.

Not sure I understand this.

> 
> > As a side note, you probably want to create a C executable to do this kind
> > of fcntl fiddling when attempting to test NFS.  That way you can use a
> > locally mounted binary and you won't wind up with all of the Perl access
> > calls on the NFS wire.  Or, at least, use a local copy of Perl.
> 
> I recommend a pared down test case.  I suspect that the problem is
> that something that is expected to have the same ID is locking
> against itself.

I can't pare it down any further using perl. If someone better at C
than I am gives me a sample C program, I'll be happy to try it.

> Does the failure occur with the same values in all cases in the
> F_RSETLKW?  If so, I suggest you capture *all* locking packets on
> your wire, and then find who is conflicting.  This may be a simple
> lock order reversal (deadly embrace deadlock) due to poor application
> performance.  You may also find that you have multiple process IDs,
> when it should be a single process ID, for the proxy PID for the
> conflicting request.  At worst, it would be nice to know the system
> that caused it.
>
> Actually, for a lock you know is threre, you *can* diagnose the
> problem (somewhat) by writing a program on the server, and using
> F_GETLK on the range for the hanging lock on the server -- this
> will return a struct flock, which will give you range and PID
> information.  Do it on the Solaris box, though.
> 
> The reason you want to do this on the Solaris box is that the
> struct flock on FreeBSD fails to include the l_rsysid -- the
> remote system ID.

Sorry, but I don't understand any of that.
 
> Actually, given this, I don't understand how FreeBSD server side
> proxy locking can actually work at all; it would incorrectly
> coelesce locks with local locks when the l_pid matched, which
> would be *all* locks in the lockd, and then incorrectly release
> them when a local process exited, or any process on any remote
> system unlocked an overlapping range (possibly in error).

So you're suggesting that when it works, it's just lucky? But others
have said that it works for them, and it seems to work OK between
FreeBSD systems.


> You are using FreeBSD as the NFS client in this case, right?  If
> so, that's probably not an issue for you...

No.

I think that you may be trying to solve a problem I don't have.
First - I'm not a programmer. I'm not trying to write any program
at all, except as necessary to diagnose this problem. I'll summarize
the situation briefly. The issue cropped up in a commercial program
(Xinet) which was working on Solaris 2.6 client and server. I'm
replacing the server with a FreeBSD box (RELENG_5_0) and the program
stopped working. Xinet tech support diagnosed it as nfs locking
problem, which I've confirmed by my simple perl program.

	Client		Server		Result
	======		=======		======
	Solaris		Solaris		Works
	FreeBSD		Solaris		Works
	FreeBSD		FreeBSD		Works
	Solaris		FreeBSD		Problems

Actually, when I say "works", all I know is that it doesn't hang.
Whether or not the lock is actually effective, I haven't tested.
Oh, and the nonblocking flock also hangs, just like the blocking one.
The lock call returns; the unlock call doesn't.

Thanks.
Steve
-- 
Steve Sizemore <steve@ls.berkeley.edu>, (510) 642-8570
Unix System Manager
    Dept. of Mathematics and College of Letters and Science
    University of California, Berkeley

To Unsubscribe: send mail to majordomo@FreeBSD.org
with "unsubscribe freebsd-current" in the body of the message




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20030318065716.GB99408>