From owner-freebsd-fs@FreeBSD.ORG Sun Apr 19 12:29:28 2015 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 3E8F98A7 for ; Sun, 19 Apr 2015 12:29:28 +0000 (UTC) Received: from esa-annu.net.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 097FA345 for ; Sun, 19 Apr 2015 12:29:27 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: A2CmBAAtnzNV/95baINchD+DEsgiglyBXBEBAQEBAQEBfYQiKFYzAgINGQJfiD60I5QNAQEBBwEBAQEBHYEhjk8XNIJvgUUFnGuQNYNOIoIFHYFtIoF1gQABAQE X-IronPort-AV: E=Sophos;i="5.11,603,1422939600"; d="scan'208";a="206439225" Received: from muskoka.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.222]) by esa-annu.net.uoguelph.ca with ESMTP; 19 Apr 2015 08:29:26 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 70E26B3F67; Sun, 19 Apr 2015 08:29:26 -0400 (EDT) Date: Sun, 19 Apr 2015 08:29:26 -0400 (EDT) From: Rick Macklem To: J David Cc: FreeBSD Filesystems Message-ID: <1287096585.21725198.1429446566451.JavaMail.root@uoguelph.ca> Subject: FreeBSD 10.1 can't "make -j5 buildworld" over NFS? MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.95.12] X-Mailer: Zimbra 7.2.6_GA_2926 (ZimbraWebClient - FF3.0 (Win)/7.2.6_GA_2926) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 19 Apr 2015 12:29:28 -0000 J David wrote: > On identical hardware, against the exact same NFS server, FreeBSD 9.3 > can do a parallel buildworld, but FreeBSD 10.1 dies in cleandir with a > bunch of "stale NFS file handle" errors. > > The mount options are the same on both clients: > > 192.168.20.161:/data/software/freebsd/releng-9.3/src /usr/src nfs > rw,tcp,nfsv3,noauto 0 0 > 192.168.20.161:/data/software/freebsd/releng-9.3/amd64/obj /usr/obj > nfs rw,tcp,nfsv3,noauto 0 0 > > > 192.168.20.161:/data/software/freebsd/releng-10.1/src /usr/src nfs > rw,tcp,nfsv3,noauto 0 0 > 192.168.20.161:/data/software/freebsd/releng-10.1/amd64/obj /usr/obj > nfs rw,tcp,nfsv3,noauto 0 0 [rest clipped for brevity] I checked and I was incorrect w.r.t. "make" changing. One thing you could try (although you said you weren't going to do anything on your last post) is disabling lookup using shared vnode locks. # sysctl vfs.lookup_shared=0 and see if that stops it from failing with ESTALE. Here's a comment from the NFS client code nfs_remove() (been there for quite a while): 1674 /* 1675 * Purge the name cache so that the chance of a lookup for 1676 * the name succeeding while the remove is in progress is 1677 * minimized. Without node locking it can still happen, such 1678 * that an I/O op returns ESTALE, but since you get this if 1679 * another host removes the file.. 1680 */ I don`t believe I wrote this comment, but my understanding is that a second thread may succeed in looking up the file (hit on the name cache) while the remove is in progress and then attempt the remove again. Disabling shared vnode locking (forcing the lookup that preceeds the remove to acquire an exclusive lock on the directory might avoid the race. My comment w.r.t. NFS not being POSIX compliant wasn`t meant to say that this problem wasn`t fixable or shouldn`t be fixed, it was meant to imply that working on a POSIX file system doesn`t imply working over NFS. Since FreeBSD9.3 also has shared vnode locking enabled for lookups (unless you disabled them), I don`t know why 10.1 would break and 9.3 doesn`t. rick