From owner-freebsd-bugs@freebsd.org  Tue Feb  2 17:22:16 2016
Return-Path: <owner-freebsd-bugs@freebsd.org>
Delivered-To: freebsd-bugs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 52E23A99EFD
 for <freebsd-bugs@mailman.ysv.freebsd.org>;
 Tue,  2 Feb 2016 17:22:16 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from kenobi.freebsd.org (kenobi.freebsd.org
 [IPv6:2001:1900:2254:206a::16:76])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 381D82EF
 for <freebsd-bugs@FreeBSD.org>; Tue,  2 Feb 2016 17:22:16 +0000 (UTC)
 (envelope-from bugzilla-noreply@freebsd.org)
Received: from bugs.freebsd.org ([127.0.1.118])
 by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u12HMGSw042161
 for <freebsd-bugs@FreeBSD.org>; Tue, 2 Feb 2016 17:22:16 GMT
 (envelope-from bugzilla-noreply@freebsd.org)
From: bugzilla-noreply@freebsd.org
To: freebsd-bugs@FreeBSD.org
Subject: [Bug 206855] NFS errors from ZFS backed file system when server
 under load
Date: Tue, 02 Feb 2016 17:22:16 +0000
X-Bugzilla-Reason: AssignedTo
X-Bugzilla-Type: new
X-Bugzilla-Watch-Reason: None
X-Bugzilla-Product: Base System
X-Bugzilla-Component: kern
X-Bugzilla-Version: 10.2-RELEASE
X-Bugzilla-Keywords: 
X-Bugzilla-Severity: Affects Some People
X-Bugzilla-Who: vivek@khera.org
X-Bugzilla-Status: New
X-Bugzilla-Resolution: 
X-Bugzilla-Priority: ---
X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org
X-Bugzilla-Flags: 
X-Bugzilla-Changed-Fields: bug_id short_desc product version rep_platform
 op_sys bug_status bug_severity priority component assigned_to reporter
Message-ID: <bug-206855-8@https.bugs.freebsd.org/bugzilla/>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/
Auto-Submitted: auto-generated
MIME-Version: 1.0
X-BeenThere: freebsd-bugs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Bug reports <freebsd-bugs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-bugs/>
List-Post: <mailto:freebsd-bugs@freebsd.org>
List-Help: <mailto:freebsd-bugs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-bugs>,
 <mailto:freebsd-bugs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 02 Feb 2016 17:22:16 -0000

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D206855

            Bug ID: 206855
           Summary: NFS errors from ZFS backed file system when server
                    under load
           Product: Base System
           Version: 10.2-RELEASE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: freebsd-bugs@FreeBSD.org
          Reporter: vivek@khera.org

I posted a question about NFS errors (unable to read directories or files) =
when
the NFS server comes under high load, and at least two other people reported
that they observe the same types of failures. The thread is at
https://lists.freebsd.org/pipermail/freebsd-questions/2016-February/270292.=
html

This might be related to bug #132068

It seems that using NFS to share a ZFS data set is not so stable under high
load. Here's my original question/bug report:


I have a handful of servers at my data center all running FreeBSD 10.2. On =
one
of them I have a copy of the FreeBSD sources shared via NFS. When this serv=
er
is running a large poudriere run re-building all the ports I need, the clie=
nts'
NFS mounts become unstable. That is, the clients keep getting read failures.
The interactive performance of the NFS server is just fine, however. The lo=
cal
file system is a ZFS mirror.

What could be causing NFS to be unstable in this situation?

Specifics:

Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS se=
rver
and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad processor.

The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS export=
ed
via the ZFS exports file. I put the FreeBSD sources on this dataset and sym=
link
to /usr/src.


Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS clie=
nt
built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically same
hardware but more RAM).

The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS optio=
ns
are "intr,nolockd". /usr/src is symlinked to the sources in that NFS mount.


What I observe:

[lorax]~% cd /usr/src
[lorax]src% svn status
[lorax]src% w
 9:12AM  up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61
USER       TTY      FROM                      LOGIN@  IDLE WHAT
vivek      pts/0    vick.int.kcilink.com       8:44AM     - tmux: client (/=
tmp/
vivek      pts/1    tmux(19747).%0            8:44AM    19 sed y%*+%pp%;s%[=
^_a
vivek      pts/2    tmux(19747).%1            8:56AM     - w
vivek      pts/3    tmux(19747).%2            8:56AM     - slogin bluefish-=
prv
[lorax]src% pwd
/u/lorax1/usr10/src

So right now the load average is more than 1 per processor on lorax. I can
quite easily run "svn status" on the source directory, and the interactive
performance is pretty snappy for editing local files and navigating around =
the
file system.


On the client:

[bluefish]~% cd /usr/src
[bluefish]src% pwd
/n/lorax1/usr10/src
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch': Part=
ial
results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory
'/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are val=
id
but processing is incomplete
[bluefish]src% w
 9:14AM  up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15
USER       TTY      FROM                      LOGIN@  IDLE WHAT
vivek      pts/0    lorax-prv.kcilink.com     8:56AM     - w
[bluefish]src% df .
Filesystem          1K-blocks    Used     Avail Capacity  Mounted on
lorax-prv:/u/lorax1 932845181 6090910 926754271     1%    /n/lorax1


What I see is more or less random failures to read the NFS volume. When the
server is not so busy running poudriere builds, the client never has any
failures.

I also observe this kind of failure doing  buildworld or installworld on the
client when the server is busy -- I get strange random failures reading the
files causing the build or install to fail.

My workaround is to not do build/installs on client machines when the NFS
server is busy doing large jobs like building all packages, but there is
definitely something wrong here I'd like to fix. I observe this on all the
local NFS clients. I rebooted the server before to try to clear this up but=
 it
did not fix it.

My intuition is pointing to some sort of race condition with ZFS and NFS, b=
ut
digging deeper into that is well beyond my pay grade.

--=20
You are receiving this mail because:
You are the assignee for the bug.=