From owner-freebsd-fs@FreeBSD.ORG Thu Apr 16 21:47:19 2015 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 8FB67EFA for ; Thu, 16 Apr 2015 21:47:19 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 558311B63 for ; Thu, 16 Apr 2015 21:47:18 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CmBADzLDBV/95baINXBoQ6BYMQwneHUgKCDBMBAQEBAQEBfYQhAQEEI1EFGxgCAg0ZAlkGE4YVghWzG5VsAQEBAQYBAQEBAQEcgSGKCIQxCA80B4JogUUFnFGGII1GIoQLIjGBAkJ/AQEB X-IronPort-AV: E=Sophos;i="5.11,590,1422939600"; d="scan'208";a="206039745" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 16 Apr 2015 17:47:08 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 6A2E0B3F43; Thu, 16 Apr 2015 17:47:08 -0400 (EDT) Date: Thu, 16 Apr 2015 17:47:08 -0400 (EDT) From: Rick Macklem To: J David Cc: freebsd-fs@freebsd.org Message-ID: <379019615.20563078.1429220828421.JavaMail.root@uoguelph.ca> In-Reply-To: Subject: Re: FreeBSD 10.1 can't "make -j5 buildworld" over NFS? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.12] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 16 Apr 2015 21:47:19 -0000 J David wrote: > On Wed, Apr 15, 2015 at 10:18 AM, Rick Macklem > wrote: > > Well, the NFS client is almost identical in the two systems. (A > > couple > > of NFSv4 specific changes and a removal of a redundant check for > > creation > > of a hard link across mount points are the only ones I can see.) > > > > As such, I'd suspect userland differences. There is a different > > "make" > > in 10 (which I don't think is in 9.3?), so this would be a good > > starting > > point. > > That may be, but this problem only occurs over NFS. It does not > happen with local UFS or ZFS. So perhaps the new make is exercising > the NFS client differently than the old one, revealing the problem. > > > Btw, "stale NFS file handle" means that the file has been deleted > > on the > > server. > > Yes it does. And the make always dies during cleandir, during which > things are being aggressively deleted. > > It does seem like that's the *only* stage that has problems. I.e. if > "make cleanworld" is run before "make -j5 buildworld" then the > parallel build will succeed. Hopefully that means it will be > relatively easy to narrow down / reproduce the problem behavior. > > However, in my experience, stale NFS file handles usually occur when > one client deletes things out from under another client (and/or after > a server reboot, which is not the case here). In this case, this is > the only client that can even mount the relevant partition as > read-write, much less writing to it. It's like the 10.1 client is > caching that stuff exists even after it removes it, leading to errors > from the server when it tries to access them again. It's pretty > unusual (again, in my experience) for a single client to trip over > *itself* when deleting things. > > Thanks! > First, I will point out that the NFS protocol is not POSIX compliant and, as such, there will be always cases where apps. that work on POSIX compliant file systems don't work on NFS. When the NFS Remove RPC is done, a file is removed. (NFS does not know if the file is open and does not maintain POSIX opens on files.) A "trick" used by the NFS client to approximate POSIX is called "silly rename". When the client sees that a file is open by another process on the machine, an unlink(2) becomes "rename file to .nfsXXX and then do a Remove RPC on it when the open count goes to 0". This normally avoids "stale NFS file handle" within a single client. This "trick" is not "race free" when done between multiple clients, but for a single client I am not aware of a problem with it. However, the FreeBSD client does this, so I doubt this is the problem. It may be something as simple as make expecting ENOENT for a remove and instead gets ESTALE from the NFS when the file has already been deleted. rick