From owner-freebsd-bugs@freebsd.org Tue Feb 2 17:22:16 2016 Return-Path: Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 52E23A99EFD for ; Tue, 2 Feb 2016 17:22:16 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 381D82EF for ; Tue, 2 Feb 2016 17:22:16 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u12HMGSw042161 for ; Tue, 2 Feb 2016 17:22:16 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-bugs@FreeBSD.org Subject: [Bug 206855] NFS errors from ZFS backed file system when server under load Date: Tue, 02 Feb 2016 17:22:16 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: new X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: vivek@khera.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform op_sys bug_status bug_severity priority component assigned_to reporter Message-ID: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-bugs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Bug reports List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Feb 2016 17:22:16 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D206855 Bug ID: 206855 Summary: NFS errors from ZFS backed file system when server under load Product: Base System Version: 10.2-RELEASE Hardware: Any OS: Any Status: New Severity: Affects Some People Priority: --- Component: kern Assignee: freebsd-bugs@FreeBSD.org Reporter: vivek@khera.org I posted a question about NFS errors (unable to read directories or files) = when the NFS server comes under high load, and at least two other people reported that they observe the same types of failures. The thread is at https://lists.freebsd.org/pipermail/freebsd-questions/2016-February/270292.= html This might be related to bug #132068 It seems that using NFS to share a ZFS data set is not so stable under high load. Here's my original question/bug report: I have a handful of servers at my data center all running FreeBSD 10.2. On = one of them I have a copy of the FreeBSD sources shared via NFS. When this serv= er is running a large poudriere run re-building all the ports I need, the clie= nts' NFS mounts become unstable. That is, the clients keep getting read failures. The interactive performance of the NFS server is just fine, however. The lo= cal file system is a ZFS mirror. What could be causing NFS to be unstable in this situation? Specifics: Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS se= rver and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad processor. The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS export= ed via the ZFS exports file. I put the FreeBSD sources on this dataset and sym= link to /usr/src. Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS clie= nt built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically same hardware but more RAM). The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS optio= ns are "intr,nolockd". /usr/src is symlinked to the sources in that NFS mount. What I observe: [lorax]~% cd /usr/src [lorax]src% svn status [lorax]src% w 9:12AM up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61 USER TTY FROM LOGIN@ IDLE WHAT vivek pts/0 vick.int.kcilink.com 8:44AM - tmux: client (/= tmp/ vivek pts/1 tmux(19747).%0 8:44AM 19 sed y%*+%pp%;s%[= ^_a vivek pts/2 tmux(19747).%1 8:56AM - w vivek pts/3 tmux(19747).%2 8:56AM - slogin bluefish-= prv [lorax]src% pwd /u/lorax1/usr10/src So right now the load average is more than 1 per processor on lorax. I can quite easily run "svn status" on the source directory, and the interactive performance is pretty snappy for editing local files and navigating around = the file system. On the client: [bluefish]~% cd /usr/src [bluefish]src% pwd /n/lorax1/usr10/src [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3': Partial results are valid but processing is incomplete [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch': Part= ial results are valid but processing is incomplete [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are val= id but processing is incomplete [bluefish]src% w 9:14AM up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15 USER TTY FROM LOGIN@ IDLE WHAT vivek pts/0 lorax-prv.kcilink.com 8:56AM - w [bluefish]src% df . Filesystem 1K-blocks Used Avail Capacity Mounted on lorax-prv:/u/lorax1 932845181 6090910 926754271 1% /n/lorax1 What I see is more or less random failures to read the NFS volume. When the server is not so busy running poudriere builds, the client never has any failures. I also observe this kind of failure doing buildworld or installworld on the client when the server is busy -- I get strange random failures reading the files causing the build or install to fail. My workaround is to not do build/installs on client machines when the NFS server is busy doing large jobs like building all packages, but there is definitely something wrong here I'd like to fix. I observe this on all the local NFS clients. I rebooted the server before to try to clear this up but= it did not fix it. My intuition is pointing to some sort of race condition with ZFS and NFS, b= ut digging deeper into that is well beyond my pay grade. --=20 You are receiving this mail because: You are the assignee for the bug.=