From owner-freebsd-current@FreeBSD.ORG Wed Dec 10 16:47:02 2008 Return-Path: Delivered-To: current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id D12911065670; Wed, 10 Dec 2008 16:47:02 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from phoenix.cs.uoguelph.ca (phoenix.cs.uoguelph.ca [131.104.94.216]) by mx1.freebsd.org (Postfix) with ESMTP id 91CD48FC1D; Wed, 10 Dec 2008 16:47:02 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from muncher.cs.uoguelph.ca (muncher.cs.uoguelph.ca [131.104.91.102]) by phoenix.cs.uoguelph.ca (8.13.1/8.13.1) with ESMTP id mBAGT07B031229; Wed, 10 Dec 2008 11:29:00 -0500 Received: from localhost (rmacklem@localhost) by muncher.cs.uoguelph.ca (8.11.7p3+Sun/8.11.6) with ESMTP id mBAGUQ125461; Wed, 10 Dec 2008 11:30:26 -0500 (EST) X-Authentication-Warning: muncher.cs.uoguelph.ca: rmacklem owned process doing -bs Date: Wed, 10 Dec 2008 11:30:26 -0500 (EST) From: Rick Macklem X-X-Sender: rmacklem@muncher.cs.uoguelph.ca To: David Wolfskill In-Reply-To: <20081209190110.GW60731@albert.catwhisker.org> Message-ID: References: <20081203001538.GC96383@bunrab.catwhisker.org> <20081209190110.GW60731@albert.catwhisker.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Scanned-By: MIMEDefang 2.63 on 131.104.94.216 Cc: hackers@freebsd.org, current@freebsd.org Subject: Re: NFS (& amd?) dysfunction descending a hierarchy X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Dec 2008 16:47:02 -0000 On Tue, 9 Dec 2008, David Wolfskill wrote: > On Tue, Dec 02, 2008 at 04:15:38PM -0800, David Wolfskill wrote: >> I seem to have a fairly- (though not deterministly so) reproducible >> mode of failure with an NFS-mounted directory hierarchy: An attempt to >> traverse a "sufficiently large" hierarchy (e.g., via "tar zcpf" or "rm >> -fr") will fail to "visit" some subdirectories, typically apparently >> acting as if the subdirectories in question do not actually exist >> (despite the names having been returned in the output of a previous >> readdir()). >> ... > > I was able to reproduce the external symptoms of the failure running > CURRENT as of yesterday, using "rm -fr" of a copy of a recent > /usr/ports hierachy on an NFS-mounted file system as a test case. > However, I believe the mechanism may be a bit different -- while > still being other than what I would expect. > > One aspect in which the externally-observable symptoms were different > (under CURRENT, vs. RELENG_7) is that under CURRENT, once the error > condition occurred, the NFS client machine was in a state where it > merely kept repeating > > nfs server pid848@fbsd-build:/volume: not responding > > until I logged in as root & rebooted it. > The different behaviour for -CURRENT could be the newer RPC layer that was recently introduced, but that doesn't explain the basic problem. All I can think of is to ask the obvious question. "Are you using interruptible or soft mounts?" If so, switch to hard mounts and see if the problem goes away. (imho, neither interruptible nor soft mounts are a good idea. You can use a forced dismount if there is a crashed NFS server that isn't coming back anytime soon.) If you are getting this with hard mounts, I'm afraid I have no idea what the problem is, rick.