From owner-freebsd-fs@freebsd.org Tue Feb 2 06:26:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 10A4BA98B0F; Tue, 2 Feb 2016 06:26:56 +0000 (UTC) (envelope-from spork@bway.net) Received: from smtp1.bway.net (smtp1.v6.bway.net [IPv6:2607:d300:1::27]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id C7A511F32; Tue, 2 Feb 2016 06:26:55 +0000 (UTC) (envelope-from spork@bway.net) Received: from frankentosh.sporklab.com (foon.sporktines.com [96.57.144.66]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: spork@bway.net) by smtp1.bway.net (Postfix) with ESMTPSA id 5A03395854; Tue, 2 Feb 2016 01:26:44 -0500 (EST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=bway.net; s=mail; t=1454394404; bh=Z8V/1DQgUvKQRuYpDJCHcKXtgCNlOhVLAm7n9MeIdx8=; h=Subject:From:In-Reply-To:Date:Cc:References:To; b=fZZuftmuw9BQTXaIfbebFtTv3wqS2+m7DxHWT3aSfmXp5CGUdLuK64wWJ+/OTZpF+ nwNw60WiiGU4U/HFYhkvPjlXvqSYrpXCij3vHy9IRnqyibjQT4LM5OaPmbXU4uLFz2 wohu0XVppzkaAcvaHaTWHS0PifLRPxBdRnYzZy0w= Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.2 \(3112\)) Subject: Re: NFS unstable with high load on server From: Charles Sprickman In-Reply-To: Date: Tue, 2 Feb 2016 01:26:43 -0500 Cc: Vick Khera , freebsd-fs@freebsd.org, "freebsd-questions@freebsd.org" Content-Transfer-Encoding: quoted-printable Message-Id: <5EAD4A4A-211F-451E-A3B9-752DAC6D94B4@bway.net> References: To: Ben Woods X-Mailer: Apple Mail (2.3112) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 02 Feb 2016 06:26:56 -0000 On Feb 2, 2016, at 1:10 AM, Ben Woods wrote: >=20 > On Monday, 1 February 2016, Vick Khera wrote: >=20 >> I have a handful of servers at my data center all running FreeBSD = 10.2. On >> one of them I have a copy of the FreeBSD sources shared via NFS. When = this >> server is running a large poudriere run re-building all the ports I = need, >> the clients' NFS mounts become unstable. That is, the clients keep = getting >> read failures. The interactive performance of the NFS server is just = fine, >> however. The local file system is a ZFS mirror. >>=20 >> What could be causing NFS to be unstable in this situation? >>=20 >> Specifics: >>=20 >> Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with = NFS >> server and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad >> processor. >>=20 >> The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS >> exported via the ZFS exports file. I put the FreeBSD sources on this >> dataset and symlink to /usr/src. >>=20 >>=20 >> Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, = NFS >> client built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor = (basically >> same hardware but more RAM). >>=20 >> The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS >> options are "intr,nolockd". /usr/src is symlinked to the sources in = that >> NFS mount. >>=20 >>=20 >> What I observe: >>=20 >> [lorax]~% cd /usr/src >> [lorax]src% svn status >> [lorax]src% w >> 9:12AM up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61 >> USER TTY FROM LOGIN@ IDLE WHAT >> vivek pts/0 vick.int.kcilink.com 8:44AM - tmux: = client >> (/tmp/ >> vivek pts/1 tmux(19747).%0 8:44AM 19 sed >> y%*+%pp%;s%[^_a >> vivek pts/2 tmux(19747).%1 8:56AM - w >> vivek pts/3 tmux(19747).%2 8:56AM - slogin >> bluefish-prv >> [lorax]src% pwd >> /u/lorax1/usr10/src >>=20 >> So right now the load average is more than 1 per processor on lorax. = I can >> quite easily run "svn status" on the source directory, and the = interactive >> performance is pretty snappy for editing local files and navigating = around >> the file system. >>=20 >>=20 >> On the client: >>=20 >> [bluefish]~% cd /usr/src >> [bluefish]src% pwd >> /n/lorax1/usr10/src >> [bluefish]src% svn status >> svn: E070008: Can't read directory = '/n/lorax1/usr10/src/contrib/sqlite3': >> Partial results are valid but processing is incomplete >> [bluefish]src% svn status >> svn: E070008: Can't read directory = '/n/lorax1/usr10/src/lib/libfetch': >> Partial results are valid but processing is incomplete >> [bluefish]src% svn status >> svn: E070008: Can't read directory >> '/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results = are >> valid but processing is incomplete >> [bluefish]src% w >> 9:14AM up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15 >> USER TTY FROM LOGIN@ IDLE WHAT >> vivek pts/0 lorax-prv.kcilink.com 8:56AM - w >> [bluefish]src% df . >> Filesystem 1K-blocks Used Avail Capacity Mounted on >> lorax-prv:/u/lorax1 932845181 6090910 926754271 1% /n/lorax1 >>=20 >>=20 >> What I see is more or less random failures to read the NFS volume. = When the >> server is not so busy running poudriere builds, the client never has = any >> failures. >>=20 >> I also observe this kind of failure doing buildworld or installworld = on >> the client when the server is busy -- I get strange random failures = reading >> the files causing the build or install to fail. >>=20 >> My workaround is to not do build/installs on client machines when the = NFS >> server is busy doing large jobs like building all packages, but there = is >> definitely something wrong here I'd like to fix. I observe this on = all the >> local NFS clients. I rebooted the server before to try to clear this = up but >> it did not fix it. >>=20 >> Any help would be appreciated. >>=20 >=20 > I just wanted to point out that I am experiencing this exact same = issue in > my home setup. >=20 > Performing an installworld from an NFS mount works perfectly, until I = start > running poudriere on the NFS server. Then I start getting NFS timeouts = and > the installworld fails. >=20 > The NFS server is also using ZFS, but the NFS export in my case is = being > done via the ZFS property "sharenfs" (I am not using the /etc/exports = file). Me three. I=E2=80=99m actually updating a small group of servers now = and started=20 blowing up my installworlds by trying to do some poudriere builds at the = same=20 time. Very repeatable. Of note, I=E2=80=99m on 9.3, and saw this on = 8.4 as well. If I=20 track down the client-side failures, it=E2=80=99s always =E2=80=9Cpermissi= on denied=E2=80=9D. Thanks, Charles >=20 > I suspect this will boil down to a ZFS tuning issue, where poudriere = and > installworld are both stress testing the server. Both of these would > obviously cause significant memory and CPU usage, and the "recently = used" > portion of the ARC to be constantly flushed as they access a large = number > of different files. >=20 > It might be interesting if you could report the output of the heading = lines > (including memory and ARC details) from the "top" command before/after > running poudriere and attempting the installworld. >=20 > Regards, > Ben >=20 >=20 > --=20 >=20 > -- > From: Benjamin Woods > woodsb02@gmail.com > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org"