Date: Thu, 9 Mar 2006 00:26:44 GMT From: Miguel Lopes Santos Ramos <miguel@anjos.strangled.net> To: kris@obsecurity.org Cc: kuriyama@imgsrc.co.jp, freebsd-stable@freebsd.org Subject: Re: rpc.lockd brokenness (2) Message-ID: <200603090026.k290Qihj002701@compaq.anjos.strangled.net> In-Reply-To: <20060308224531.GA53611@xor.obsecurity.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> From: Kris Kennaway <kris@obsecurity.org> > Subject: Re: rpc.lockd brokenness (2) > > This is intentional. It's how pidfile_*() tests whether the process > is still running. The intention is that if someone tries to open the > pidfile again while the first process is still running, the lock > acquisition will fail and we'll know the other process is still alive, > and therefore avoid starting a second instance. No, no, you got me wrong. The pidfile is left locked after cron stopped running (with /etc/rc.d/cron stop). This behaviour must be wrong. > Your main problems seems to be that you're mounting the same /var via > NFS from multiple client machines. This is basically a bad idea to > begin with because /var expects to be private to each machine (even if > locking worked as expected, you'd not be able to start cron on more > than one machine because it would fail as above). Even if you solved > this there would be other similar problems. No, it's the whole filesystem tree for a single client, no one else uses those files. The fact that I hung a third machine was an accident, I was testing if cron.pid was still locked and I thought I had a window on the server... My single problem is locking. Actually, it worked well before I upgraded this system to 6-STABLE. It's just for one laptop whose disk I don't want to partition. > In fact the diskless boot infrastructure in /etc will set up and use a > md /var for this purpose. Actually, they don't advise using an md /var, only /etc. Anyway, I don't use that, because it's my only diskless machine. I have a single NFS mounted / and an md /tmp. There's nothing shared with no one else, not even /usr, because it's my only amd64. > There is a (known) lockd bug here though, which you isolated: > So, this really is bin/80389? If so, I can tell Jun Kuriyama that his patch didn't change it. > > With /var/run/cron.pid still locked, on the first client, single-user, sa= > me > > initialization sequence > > # lockf -k -t 1 /var/run/cron.pid echo ok > > Hangs... always. > > which is that lock requests through rpc.lockd cannot be cancelled, so > they'll hang until the operation succeeds or fails. In this case > lockf does a blocking lock request and expects to cancel it with a > signal after the timer expires, but rpc.lockd doesn't know how to back > out lock requests so it just hangs forever or until something else > unlocks the file on the server. > > Kris I am a bit disappointed. First, this problem didn't cause me trouble before I went to 6-STABLE, now I must either disable cron or disable locking (which I can't). And I'm still not completely convinced. That problem, if I understand correctly, existed before January... There are two things... - cron.pid shouldn't be locked after cron terminated. (this interaction was fully saved as http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin) - cron shouldn't hang on startup just because the file is locked, since pidfile_open opens it with O_NONBLOCK (unlike lockf). - cron shouldn't hang in such a way that it is not killable... (and should not also the open system call in lockf be interruptible?) So, I'm led to believe that beyond that issue with rpc.lockd, which, I understand, is an unresolved problem, there is now another problem, perhaps with pidfile.c... Thank you for all your time on this issue. I'm still going to try to chase it, although I only have the knowledge to find it if it is on pidfile.c or in cron. I understand little of the interaction between kernel and the rest of nfs to chase it if it is somewhere else. Miguel
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200603090026.k290Qihj002701>