From owner-freebsd-questions@freebsd.org Mon Feb 1 14:25:54 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3418AA974C5 for ; Mon, 1 Feb 2016 14:25:54 +0000 (UTC) (envelope-from vivek@khera.org) Received: from mail-wm0-x22a.google.com (mail-wm0-x22a.google.com [IPv6:2a00:1450:400c:c09::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id C104E1242 for ; Mon, 1 Feb 2016 14:25:53 +0000 (UTC) (envelope-from vivek@khera.org) Received: by mail-wm0-x22a.google.com with SMTP id r129so73088688wmr.0 for ; Mon, 01 Feb 2016 06:25:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=khera.org; s=google11; h=mime-version:date:message-id:subject:from:to:content-type; bh=EYWj+UUMdzhigFx2VPqqyxhx1VXJn0ip63Pm0+Xe+4o=; b=b/MGjGRcfiOVsujnEPJhKk2NHogeMzb1WoaBycejEXxxO2/FwYkvZAShi89uegoZOu NUhrzRpEYWNheurUb09mSsszWxzZ6cRie81XSKI6GpgYPjziAk6KFVRZwHecKvI7KrlZ fwi/Bl3fP9LvuFSira4dOJD1hY/FHu9MUV/8I= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:date:message-id:subject:from:to :content-type; bh=EYWj+UUMdzhigFx2VPqqyxhx1VXJn0ip63Pm0+Xe+4o=; b=ZvPzAa9WqYv9JR6dEqmNFeRQ+E8efVt/k0aMoNsCr82iPd4MFaOYVvjyZ7LoEfGCEZ HUH9THh1vLOz3zeFkSQvSS6rZ0ubLIlVi1/B3vUPr+dX5xGYZC0G6TUvPndTWx7q2Yjp bHSKDC6AiB+RvhJxKBV9Hm/0crXgWccUPPr20xZDCePGhinKWRpn3ZIVN2yQhVYrOgsa jGP293MTwXv+6p+s3G+rSWnrw/OF1VTTz5OtKxHWHoe3gb2CMuUxtTzxat9pZbav774G 09O3qKt8leooTU+8v2e4cZGw1opA2L1Kz8rQLCXPZd9QBwgUkZQveIvMPwRO8/FImN5+ JzEw== X-Gm-Message-State: AG10YORJljQ/lPsUKfujRn3uBf9EjJm9Bh2XQalU2E1ICelQdPnDqqcHFYIKB/6bOrAnPC0IdOIfTfQY5h3iLQ== MIME-Version: 1.0 X-Received: by 10.28.13.213 with SMTP id 204mr11546423wmn.16.1454336750425; Mon, 01 Feb 2016 06:25:50 -0800 (PST) Received: by 10.28.165.138 with HTTP; Mon, 1 Feb 2016 06:25:50 -0800 (PST) Date: Mon, 1 Feb 2016 09:25:50 -0500 Message-ID: Subject: NFS unstable with high load on server From: Vick Khera To: freebsd-questions@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 01 Feb 2016 14:25:54 -0000 I have a handful of servers at my data center all running FreeBSD 10.2. On one of them I have a copy of the FreeBSD sources shared via NFS. When this server is running a large poudriere run re-building all the ports I need, the clients' NFS mounts become unstable. That is, the clients keep getting read failures. The interactive performance of the NFS server is just fine, however. The local file system is a ZFS mirror. What could be causing NFS to be unstable in this situation? Specifics: Server "lorax" FreeBSD 10.2-RELEASE-p7 kernel locally compiled, with NFS server and ZFS as dynamic kernel modules. 16GB RAM, Xeon 3.1GHz quad processor. The directory /u/lorax1 a ZFS dataset on a mirrored pool, and is NFS exported via the ZFS exports file. I put the FreeBSD sources on this dataset and symlink to /usr/src. Client "bluefish" FreeBSD 10.2-RELEASE-p5 kernel locally compiled, NFS client built in to kernel. 32GB RAM, Xeon 3.1GHz quad processor (basically same hardware but more RAM). The directory /n/lorax1 is NFS mounted from lorax via autofs. The NFS options are "intr,nolockd". /usr/src is symlinked to the sources in that NFS mount. What I observe: [lorax]~% cd /usr/src [lorax]src% svn status [lorax]src% w 9:12AM up 12 days, 19:19, 4 users, load averages: 4.43, 4.45, 3.61 USER TTY FROM LOGIN@ IDLE WHAT vivek pts/0 vick.int.kcilink.com 8:44AM - tmux: client (/tmp/ vivek pts/1 tmux(19747).%0 8:44AM 19 sed y%*+%pp%;s%[^_a vivek pts/2 tmux(19747).%1 8:56AM - w vivek pts/3 tmux(19747).%2 8:56AM - slogin bluefish-prv [lorax]src% pwd /u/lorax1/usr10/src So right now the load average is more than 1 per processor on lorax. I can quite easily run "svn status" on the source directory, and the interactive performance is pretty snappy for editing local files and navigating around the file system. On the client: [bluefish]~% cd /usr/src [bluefish]src% pwd /n/lorax1/usr10/src [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/contrib/sqlite3': Partial results are valid but processing is incomplete [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/lib/libfetch': Partial results are valid but processing is incomplete [bluefish]src% svn status svn: E070008: Can't read directory '/n/lorax1/usr10/src/release/picobsd/tinyware/msg': Partial results are valid but processing is incomplete [bluefish]src% w 9:14AM up 93 days, 23:55, 1 user, load averages: 0.10, 0.15, 0.15 USER TTY FROM LOGIN@ IDLE WHAT vivek pts/0 lorax-prv.kcilink.com 8:56AM - w [bluefish]src% df . Filesystem 1K-blocks Used Avail Capacity Mounted on lorax-prv:/u/lorax1 932845181 6090910 926754271 1% /n/lorax1 What I see is more or less random failures to read the NFS volume. When the server is not so busy running poudriere builds, the client never has any failures. I also observe this kind of failure doing buildworld or installworld on the client when the server is busy -- I get strange random failures reading the files causing the build or install to fail. My workaround is to not do build/installs on client machines when the NFS server is busy doing large jobs like building all packages, but there is definitely something wrong here I'd like to fix. I observe this on all the local NFS clients. I rebooted the server before to try to clear this up but it did not fix it. Any help would be appreciated.