Date: Wed, 5 Jul 2006 15:20:40 +0300 From: Kostik Belousov <kostikbel@gmail.com> To: Robert Watson <rwatson@freebsd.org> Cc: freebsd-stable@freebsd.org, Michel Talon <talon@lpthe.jussieu.fr> Subject: Re: NFS Locking Issue Message-ID: <20060705122040.GN37822@deviant.kiev.zoral.com.ua> In-Reply-To: <20060705113822.GM37822@deviant.kiev.zoral.com.ua> References: <E1FxzUU-000MMw-5m@cs1.cs.huji.ac.il> <20060705100403.Y80381@fledge.watson.org> <20060705113822.GM37822@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
--hnsKUeImFCk/igEn Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Jul 05, 2006 at 02:38:22PM +0300, Kostik Belousov wrote: > On Wed, Jul 05, 2006 at 10:09:24AM +0100, Robert Watson wrote: > > The most significant problem working with rpc.lockd is creating easy to= =20 > > reproduce test cases. Not least because they can potentially involve= =20 > > multiple clients. If you can help to produce simple test cases to=20 > > reproduce the bugs you're seeing, that would be invaluable. > >=20 > ........ > >=20 > > Reducing complex failure modes to easily reproduced test cases is trick= y=20 > > also, though. It requires careful analysis, often with ktrace and=20 > > tcpdump/ethereal to work out what's going on, and not a little luck to= =20 > > perform the reduction of a large trace down to a simple test scenario. = The=20 > > first step is to try and figure out what, if any, specific workload res= ults=20 > > in a problem. For example, can you trigger it using work on just one= =20 > > client against a server, without client<->client interactions? This ma= kes=20 > > tracking and reproduction a lot easier, as multi-client test cases are= =20 > > really tricky! Once you've established whether it can be reproduced wi= th a=20 > > single client, you have to track down the behavior that triggers it --= =20 > > normally, this is done by attempting to narrow down the specific progra= m or=20 > > sequence of events that causes the bug to trigger, removing things one = at a=20 > > time to see what causes the problem to disappear. This is made more=20 > > difficult as lock managers are sensitive to timing, so removing a high = load=20 > > item from the list, even if it isn't the source of the problem, might c= ause=20 > > it to trigger less frequently. >=20 > I made the patch for rpc.lockd that could somewhat ease obtaining > debug information. Patch is available at > http://people.freebsd.org/~kib/rpc.lockd-debug.patch >=20 > No functional changes. Patch only adds dumping of currently held locks > (as perceived by lockd) on receiving of SIGUSR1. You need to specify > debug level 2 or 3 to obtain the dump. >=20 > Also, the both lockd processes now put identification information > in the proctitle (srv and kern). SIGUSR1 shall be sent to srv process. Hmm, after looking at the dump there and some code reading, I have noted the following: 1. NLM lock request contains the field caller_name. It is filled by (let call it) kernel rpc.lockd by the results of hostname(3). 2. This caller_name is used by server rpc.lockd to send request for host monitoring to rpc.statd (see send_granted). Request is made by clnt_call, that is blocking rpc call. 3. rpc.statd does getaddrinfo on caller_name to determine address of the host to monitor. If the getaddrinfo in step 3 waits for resolver, then your client machine will get locking process in"lockd" state. Could people experiencing rpc.lockd mistery at least report whether _server_ machine successfully resolve hostname of clients as reported by hostname? And, if yes, to what family of IP protocols ? --hnsKUeImFCk/igEn Content-Type: application/pgp-signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.4 (FreeBSD) iD8DBQFEq66XC3+MBN1Mb4gRAihxAJ0SnlK6dgxW2Avpgk0XQmnRbLJn2ACeKu4e IBHKWUU0NroCooOkXQe5TNc= =ixeW -----END PGP SIGNATURE----- --hnsKUeImFCk/igEn--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20060705122040.GN37822>