Date: Mon, 28 Aug 2006 09:48:48 +0000 (GMT) From: Michael Abbott <michael@araneidae.co.uk> To: Greg Byshenk <gbyshenk@byshenk.net> Cc: Kostik Belousov <kostikbel@gmail.com>, freebsd-stable@freebsd.org Subject: Re: NFS locking: lockf freezes (rpc.lockd problem?) Message-ID: <20060828082047.J61842@saturn.araneidae.co.uk> In-Reply-To: <20060827202803.GP633@core.byshenk.net> References: <20060827102135.B49194@saturn.araneidae.co.uk> <20060827135434.GH79046@deviant.kiev.zoral.com.ua> <20060827183903.G52383@saturn.araneidae.co.uk> <20060827202803.GP633@core.byshenk.net>
next in thread | previous in thread | raw e-mail | index | archive | help
Well, the result I have to report is so interesting, I'll give the executive summary right away: If rpc.lockd is run with -d255 it works perfectly If rpc.lockd is run with -d1 (or no -d option) it locks up Sweet. Does anyone out there who understands rpc.lockd fancy a deeper look? On Sun, 27 Aug 2006, Greg Byshenk wrote: > The problem here is that the process is waiting for somthing, and > thus not listening to signals (including your 'kill'). > > I'm not an expert on this, but my first guess would be that saturn (your > server) is offering something that it can't deliver. That is, the client > asks the server "can you do X?", and the server says "yes I can", so the > client says "do X" and waits -- and the server never does it. I buy that analysis. However, I think that the client's (venus) behaviour is unreasonable, and there must be some problem at the driver level: unkillable processes? (Tries to bang drum about this!) Interesting: it looks like `nfslocking stop` releases the processes. > Or alternatively (based on your rpc.statd dying), rpc.lockd on your > client is trying to use rpc.statd to communicate with your server. And > it starts successfully, but then rpc.statd dies (for some reason) and > your lock ends up waiting forever for it to answer. Not quite: it was the first instantiation of rpc.lockd that went away, almost as if it was just waiting for something to happen! However, it doesn't do this on the simple one-line test, so yes, I think there's something here to investigate. Definitely: after running the test with no failures (see below), both lockd instances are still up and about. > I would recommend starting both rpc.lockd and rpc.statd with the '-d' > flag, to see if this provides any information as to what is going on. > There may well be a bug somewhere, but you need to find where it is. > I suspect that it is not actually in rpc.statd, as nothing in the > source has changed since January 2005. Ok, I'll try that. I'll try -d1, see what I get. venus# /etc/rc.d/nfslocking stop venus# /etc/rc.d/nfslocking start Oddly, running restart only restarts statd, not lockd. Strange naming convention for the commands, too -- have to put the following in rc.conf: rpc_lockd_enable=YES lockd_flags=-d1 rpc_statd_enable=YES statd_flags=-d Hmm. Not terribly consistent. Ok, let's see what a single lockf exchange looks like: venus# mount saturn:$EXPORT /mnt venus# lockf /mnt/test ls /mnt test venus# tail -n3 /var/log/debug.log Aug 28 08:52:44 venus rpc.statd: unmon_all for host: NFS NLM prog: 0 ver: 0 proc: 0 Aug 28 08:54:19 venus rpc.lockd: nlm_lock_res from saturn.araneidae.co.uk Aug 28 08:54:19 venus rpc.lockd: nlm_unlock_res from saturn.araneidae.co.uk Good. Now let's run the test: venus# cd /usr/src; make installworld DESTDIR=/mnt and, at the same time, in another shell: venus# tail -f /var/log/debug.log Well, that's odd. I get five nlm_lock_res/nlm_unlock_res pairs, with the last three in less than a second... and then nothing: the last, blocking, lockf doesn't generate any messages at all! Interesting: stopping the lock daemon, by running `/etc/rc.d/nfslocking stop`, releases the lock! Good. Now I can run the test again with more logging (I'll set lockd_flags=-d255, though a quick grep of the source suggests that 6 would suffice). Hmmm. It's working perfectly now! Well well well. What are the differences? 1. I'm rerunning the test after restarting the lock and stat daemons without an intervening reboot. 2. I'm running lockd with maximum debugging. 3. I'm running the test in a different virtual console (I think we can ignore that difference!) Fantastic: it ran to completion without a fault! Wow. I'll reboot (keep the same debug level) and try again... Astounding! Watch carefully: 1. Reboot 2. Login on virtual consoles 1 & 2 3. On console 2 run venus# tail -f /var/log/debug.log 4. On console 1 run venus# mount saturn:$EXPORT /mnt venus# rm -rf /mnt/* venus# cd /usr/src; make installworld DESTDIR=/mnt 5. Switch to console 2 and watch the console and the PC's activity lights. Lots and lots of network activity (well, there's a surprise), and the occasional flurry of rcp.lockd messages. 6. When the lights stop flashing (five minutes or so) switch back to console 1 and see that everything ran perfectly. Runs ok. Well. I can think of two possiblities: 1. High levels of logging do more than just logging: there's an inadvertent side effect. 2. There's a tight timing issue that is changed by the extra delays introduced by logging. On the whole I buy 2, and it's not altogether clear whether the issue is on the client or the server. Hmm. Maybe I need to try Kostik Belousov's suggestion of running tcpdump. Another message for that... > An alternative would be to update to RELENG_6 (or at least RELENG_6_1) > and then try again. This machine is so painfully slow that I'll probably have to do that overnight, and then I'm out of time until next weekend. Just installed cvsup (from ports; oops, my mistake, forgot about having to build ezm3!)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060828082047.J61842>
