From owner-freebsd-stable@FreeBSD.ORG Wed Mar 8 22:45:33 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 0D67116A420 for ; Wed, 8 Mar 2006 22:45:33 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from elvis.mu.org (elvis.mu.org [192.203.228.196]) by mx1.FreeBSD.org (Postfix) with ESMTP id B6B7C43D45 for ; Wed, 8 Mar 2006 22:45:32 +0000 (GMT) (envelope-from kris@obsecurity.org) Received: from obsecurity.dyndns.org (elvis.mu.org [192.203.228.196]) by elvis.mu.org (Postfix) with ESMTP id A01721A4D7B; Wed, 8 Mar 2006 14:45:32 -0800 (PST) Received: by obsecurity.dyndns.org (Postfix, from userid 1000) id 73D84524AA; Wed, 8 Mar 2006 17:45:31 -0500 (EST) Date: Wed, 8 Mar 2006 17:45:31 -0500 From: Kris Kennaway To: Miguel Lopes Santos Ramos Message-ID: <20060308224531.GA53611@xor.obsecurity.org> References: <20060308005138.GA49684@xor.obsecurity.org> <200603081401.k28E1Obv006775@compaq.anjos.strangled.net> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="VS++wcV0S1rZb1Fb" Content-Disposition: inline In-Reply-To: <200603081401.k28E1Obv006775@compaq.anjos.strangled.net> User-Agent: Mutt/1.4.2.1i Cc: kuriyama@imgsrc.co.jp, freebsd-stable@freebsd.org, kris@obsecurity.org Subject: Re: rpc.lockd brokenness (2) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Mar 2006 22:45:33 -0000 --VS++wcV0S1rZb1Fb Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Mar 08, 2006 at 02:01:24PM +0000, Miguel Lopes Santos Ramos wrote: > > I wonder if something else is going wrong and it's not rpc.lockd at > > all. >=20 > Oh, it's a locking problem alright. But perhaps not in rpc.lockd... OK, I think I understand what is going on now...sort of. > > It looks like this wasn't made using -s 0 - sorry if I wasn't > > explicit. >=20 > You must give all details to rookies... Sorry. > I've changed things a bit, but perhaps there's a test now which is more e= asily > reproducible on other systems. >=20 > The following tcpdumps were obtaining by booting in single-user mode on t= he > diskless machine and doing the following sequence for initialization: > # mount -u / > # /etc/rc.d/netif start > # /etc/rc.d/rpcbind start > # /etc/rc.d/nfsclient start > # /etc/rc.d/nfslocking start >=20 > And then, with /var/run/cron.pid removed, > # /etc/rc.d/cron start > Starting cron. > # /etc/rc.d/cron stop > # /etc/rc.d/nfslocking stop > # /etc/rc.d/nfsclient stop > # /etc/rc.d/rpcbind stop > # reboot > see http://mega.ist.utl.pt/~mlsr/nfs-nofile.bin > Everything seemed to be ok, but /var/run/cron.pid was left locked= on > the server. This is intentional. It's how pidfile_*() tests whether the process is still running. The intention is that if someone tries to open the pidfile again while the first process is still running, the lock acquisition will fail and we'll know the other process is still alive, and therefore avoid starting a second instance. Your main problems seems to be that you're mounting the same /var via NFS from multiple client machines. This is basically a bad idea to begin with because /var expects to be private to each machine (even if locking worked as expected, you'd not be able to start cron on more than one machine because it would fail as above). Even if you solved this there would be other similar problems. In fact the diskless boot infrastructure in /etc will set up and use a md /var for this purpose. There is a (known) lockd bug here though, which you isolated: > With /var/run/cron.pid still locked, on the first client, single-user, sa= me > initialization sequence > # lockf -k -t 1 /var/run/cron.pid echo ok > Hangs... always. which is that lock requests through rpc.lockd cannot be cancelled, so they'll hang until the operation succeeds or fails. In this case lockf does a blocking lock request and expects to cancel it with a signal after the timer expires, but rpc.lockd doesn't know how to back out lock requests so it just hangs forever or until something else unlocks the file on the server. Kris --VS++wcV0S1rZb1Fb Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.1 (FreeBSD) iD8DBQFED16KWry0BWjoQKURAtdkAKDOZ/hNxMPgL500so0t8Mtl0Oi01QCfXouN huuWeT9TL2A9EkS3oIOWwlo= =uOe4 -----END PGP SIGNATURE----- --VS++wcV0S1rZb1Fb--