Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 1 Feb 2016 09:25:50 -0500
From:      Vick Khera <vivek@khera.org>
To:        freebsd-questions@freebsd.org
Subject:   NFS unstable with high load on server
Message-ID:  <CALd%2BdcfzPU=nMGo41BBZzt3jQnsQJaANVyA222TDM_is2Ueo0A@mail.gmail.com>

index | next in thread | raw e-mail

I have a handful of servers at my data center all running FreeBSD 10.2. On
one of them I have a copy of the FreeBSD sources shared via NFS. When this
server is running a large poudriere run re-building all the ports I need,
the clients' NFS mounts become unstable. That is, the clients keep getting
read failures. The interactive performance of the NFS server is just fine,
however. The local file system is a ZFS mirror.

What could be causing NFS to be unstable in this situation?

Specifics:

Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS
server and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad
processor.

The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS
exported via the ZFS exports file. I put the FreeBSD sources on this
dataset and symlink to /usr/src.


Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS
client built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically
same hardware but more RAM).

The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS
options are "intr,nolockd". /usr/src is symlinked to the sources in that
NFS mount.


What I observe:

[lorax]~% cd /usr/src
[lorax]src% svn status
[lorax]src% w
 9:12AM  up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61
USER       TTY      FROM                      LOGIN@  IDLE WHAT
vivek      pts/0    vick.int.kcilink.com      8:44AM     - tmux: client
(/tmp/
vivek      pts/1    tmux(19747).%0            8:44AM    19 sed
y%*+%pp%;s%[^_a
vivek      pts/2    tmux(19747).%1            8:56AM     - w
vivek      pts/3    tmux(19747).%2            8:56AM     - slogin
bluefish-prv
[lorax]src% pwd
/u/lorax1/usr10/src

So right now the load average is more than 1 per processor on lorax. I can
quite easily run "svn status" on the source directory, and the interactive
performance is pretty snappy for editing local files and navigating around
the file system.


On the client:

[bluefish]~% cd /usr/src
[bluefish]src% pwd
/n/lorax1/usr10/src
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch':
Partial results are valid but processing is incomplete
[bluefish]src% svn status
svn: E070008: Can't read directory
'/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are
valid but processing is incomplete
[bluefish]src% w
 9:14AM  up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15
USER       TTY      FROM                      LOGIN@  IDLE WHAT
vivek      pts/0    lorax-prv.kcilink.com     8:56AM     - w
[bluefish]src% df .
Filesystem          1K-blocks    Used     Avail Capacity  Mounted on
lorax-prv:/u/lorax1 932845181 6090910 926754271     1%    /n/lorax1


What I see is more or less random failures to read the NFS volume. When the
server is not so busy running poudriere builds, the client never has any
failures.

I also observe this kind of failure doing  buildworld or installworld on
the client when the server is busy -- I get strange random failures reading
the files causing the build or install to fail.

My workaround is to not do build/installs on client machines when the NFS
server is busy doing large jobs like building all packages, but there is
definitely something wrong here I'd like to fix. I observe this on all the
local NFS clients. I rebooted the server before to try to clear this up but
it did not fix it.

Any help would be appreciated.


help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CALd%2BdcfzPU=nMGo41BBZzt3jQnsQJaANVyA222TDM_is2Ueo0A>