From owner-freebsd-stable@FreeBSD.ORG Wed Jul 5 14:51:46 2006 Return-Path: X-Original-To: freebsd-stable@freebsd.org Delivered-To: freebsd-stable@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 946F616A4DA; Wed, 5 Jul 2006 14:51:46 +0000 (UTC) (envelope-from freebsd@hub.org) Received: from hub.org (hub.org [200.46.204.220]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1D50743D49; Wed, 5 Jul 2006 14:51:45 +0000 (GMT) (envelope-from freebsd@hub.org) Received: from localhost (wm.hub.org [200.46.204.128]) by hub.org (Postfix) with ESMTP id BF842290C1E; Wed, 5 Jul 2006 11:51:38 -0300 (ADT) Received: from hub.org ([200.46.204.220]) by localhost (mx1.hub.org [200.46.204.128]) (amavisd-new, port 10024) with ESMTP id 76859-02; Wed, 5 Jul 2006 14:51:44 +0000 (UTC) Received: from ganymede.hub.org (blk-7-151-244.eastlink.ca [71.7.151.244]) by hub.org (Postfix) with ESMTP id 9FFA4290C20; Wed, 5 Jul 2006 11:51:37 -0300 (ADT) Received: by ganymede.hub.org (Postfix, from userid 1027) id 7938849A13; Wed, 5 Jul 2006 11:51:49 -0300 (ADT) Received: from localhost (localhost [127.0.0.1]) by ganymede.hub.org (Postfix) with ESMTP id 77BCB4825E; Wed, 5 Jul 2006 11:51:49 -0300 (ADT) Date: Wed, 5 Jul 2006 11:51:49 -0300 (ADT) From: User Freebsd To: Robert Watson In-Reply-To: <20060705100403.Y80381@fledge.watson.org> Message-ID: <20060705114848.F1103@ganymede.hub.org> References: <20060705100403.Y80381@fledge.watson.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-stable@freebsd.org, Michel Talon Subject: Re: NFS Locking Issue X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 05 Jul 2006 14:51:46 -0000 On Wed, 5 Jul 2006, Robert Watson wrote: > On Wed, 5 Jul 2006, Danny Braniss wrote: > >> In my case our main servers are NetApp, and the problems are more related >> to am-utils running into some race condition (need more time to debug this >> :-) the other problem is related to throughput, freebsd is slower than >> linux, and while freebsd/nfs/tcp is faster on Freebsd than udp, on linux >> it's the same. So it seems some tunning is needed. >> >> our main problem now is samba/rpc.lockd, we are stuck with a server running >> FreeBSD 5.4 which crashes, and we can't upgrade to 6.1 because lockd >> doesn't work. >> >> So, if someone is willing to look into the lockd issue, we would like to >> help. > > The most significant problem working with rpc.lockd is creating easy to > reproduce test cases. Not least because they can potentially involve > multiple clients. If you can help to produce simple test cases to reproduce > the bugs you're seeing, that would be invaluable. > > I'm aware of two general classes of problems with rpc.lockd. First, > architectural issues, some derived from architectural problems in the NLM > protocol: for example, assumptions that there can be a clean mapping of > process lock owners to locks, which fall down as locks are properties of file > descriptors that can be inheritted. Second, implementation bugs/misfeatures, > such as the kernel not knowing how to cancel lock requests, so being unable > to implement interruptible waits on locks in the distributed case. > > Reducing complex failure modes to easily reproduced test cases is tricky > also, though. It requires careful analysis, often with ktrace and > tcpdump/ethereal to work out what's going on, and not a little luck to > perform the reduction of a large trace down to a simple test scenario. The > first step is to try and figure out what, if any, specific workload results > in a problem. For example, can you trigger it using work on just one client > against a server, without client<->client interactions? This makes tracking > and reproduction a lot easier, as multi-client test cases are really tricky! > Once you've established whether it can be reproduced with a single client, > you have to track down the behavior that triggers it -- normally, this is > done by attempting to narrow down the specific program or sequence of events > that causes the bug to trigger, removing things one at a time to see what > causes the problem to disappear. This is made more difficult as lock > managers are sensitive to timing, so removing a high load item from the list, > even if it isn't the source of the problem, might cause it to trigger less > frequently. I'm not sure if this is an option for anyone, either developer or user, but in the past, on particularly tricky bugs where I seemed to be the only one to be able to produce it, I've given access to a 'trusted developer' to the machine itself, to minimize the time lag that emails create ... but, also, to let the developer at a machine that has the load required to easily reproduce it ... Not sure if there is anyone out there, on either side of the proverbial fence, that feels comfortable doing this, but figured I'd throw the idea out ... I believe, in Francisco's case, they are willing to pay someone to fix the NFS issues they are having, which, i'd assume, means easy access to the problematic server(s) to do proper testing in a "real life scenario" ... ---- Marc G. Fournier Hub.Org Networking Services (http://www.hub.org) Email . scrappy@hub.org MSN . scrappy@hub.org Yahoo . yscrappy Skype: hub.org ICQ . 7615664