From owner-freebsd-current@FreeBSD.ORG Tue May 13 11:51:28 2003 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id D991537B401; Tue, 13 May 2003 11:51:28 -0700 (PDT) Received: from fledge.watson.org (fledge.watson.org [204.156.12.50]) by mx1.FreeBSD.org (Postfix) with ESMTP id BDA2243FA3; Tue, 13 May 2003 11:51:27 -0700 (PDT) (envelope-from robert@fledge.watson.org) Received: from fledge.watson.org (localhost [127.0.0.1]) by fledge.watson.org (8.12.9/8.12.9) with ESMTP id h4DIpIOn090224; Tue, 13 May 2003 14:51:18 -0400 (EDT) (envelope-from robert@fledge.watson.org) Received: from localhost (robert@localhost)h4DIpIi3090221; Tue, 13 May 2003 14:51:18 -0400 (EDT) (envelope-from robert@fledge.watson.org) Date: Tue, 13 May 2003 14:51:17 -0400 (EDT) From: Robert Watson X-Sender: robert@fledge.watson.org To: "Andrew P. Lentvorski, Jr." In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII cc: Don Lewis cc: alfred@FreeBSD.org cc: current@FreeBSD.org Subject: Re: rpc.lockd spinning; much breakage X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 13 May 2003 18:51:29 -0000 On Tue, 13 May 2003, Robert Watson wrote: > So the client isn't retrying, or mapping errors right after this patch, > but the failure modes are more consistent and I seem not to be getting > any interminable hangs anymore on the client. I should clarify this statement: I no longer get the odd hangs when it comes to client and server interactions when contending a lock established on the server and now tested by the client. I still bump into the "client isn't woken up in a timely manner after a lock is released by the same or another client". Here's the demonstration case with a bit more detail from what I presented earlier. The server runs on host cboss, the client runs twice on host crash1 on different pty's. In this scenario, each client attempts to grab an exclusive lock, potentially blocking, and then sleep for 10 seconds (this is with one of the earlier posted patches): crash1:/tmp> ./locktest nocreate openlock block noflock test 10 933 open(test, 32, 0666) Tue May 13 14:31:31 2003 933 open() returns Tue May 13 14:31:31 2003 933 sleep(10) Tue May 13 14:31:31 2003 933 sleep() returns Tue May 13 14:31:41 2003 crash1:/tmp> ./locktest nocreate openlock block noflock test 0 934 open(test, 32, 0666) Tue May 13 14:31:33 2003 934 open() returns Tue May 13 14:31:53 2003 rpc.lockd results on crash1: May 13 14:31:31 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1 May 13 14:31:33 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1 May 13 14:31:42 crash1 rpc.lockd: nlm_granted_msg from 192.168.50.1 May 13 14:31:42 crash1 rpc.lockd: nlm_unlock_res from 192.168.50.1 May 13 14:31:42 crash1 rpc.lockd: process 933: No such process May 13 14:31:53 crash1 rpc.lockd: nlm_lock_res from 192.168.50.1 In this example, pid 934 requests the lock on the object at 14:31:33 -- pid 933 released that lock at 14:31:41, but the pid 934 isn't notified until 14:31:53. It looks like it should have been notified at 14:31:42 when a granted message is received, but instead it is notified when the client rpc.lockd polls again 10 seconds from lock inception. I almost wonder if that ESRCH shouldn't have been the notification for 934 and it was using the wrong pid. Robert N M Watson FreeBSD Core Team, TrustedBSD Projects robert@fledge.watson.org Network Associates Laboratories