From owner-freebsd-fs@freebsd.org  Sat Feb 27 04:21:15 2016
Return-Path: <owner-freebsd-fs@freebsd.org>
Delivered-To: freebsd-fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id CA03CAB5CCF
 for <freebsd-fs@mailman.ysv.freebsd.org>; Sat, 27 Feb 2016 04:21:15 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org
 [IPv6:2001:1900:2254:206a::50:5])
 by mx1.freebsd.org (Postfix) with ESMTP id B8CBAED7
 for <freebsd-fs@freebsd.org>; Sat, 27 Feb 2016 04:21:15 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: by mailman.ysv.freebsd.org (Postfix)
 id B801AAB5CCE; Sat, 27 Feb 2016 04:21:15 +0000 (UTC)
Delivered-To: fs@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id B7988AB5CCD
 for <fs@mailman.ysv.freebsd.org>; Sat, 27 Feb 2016 04:21:15 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au
 [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 69A78ED6
 for <fs@freebsd.org>; Sat, 27 Feb 2016 04:21:15 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from c110-21-41-193.carlnfd1.nsw.optusnet.com.au
 (c110-21-41-193.carlnfd1.nsw.optusnet.com.au [110.21.41.193])
 by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id E733F1046FF0;
 Sat, 27 Feb 2016 15:21:04 +1100 (AEDT)
Date: Sat, 27 Feb 2016 15:21:02 +1100 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Bruce Evans <brde@optusnet.com.au>
cc: fs@freebsd.org
Subject: Re: silly write caching in nfs3
In-Reply-To: <20160226164613.N2180@besplex.bde.org>
Message-ID: <20160227131353.V1337@besplex.bde.org>
References: <20160226164613.N2180@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0
 a=73JWPhLeruqQCjN69UNZtQ==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10
 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=MMWvXMvT1jwRzmCVoGYA:9
 a=CjuIK1q_8ugA:10
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs/>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
 <mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 27 Feb 2016 04:21:16 -0000

On Fri, 26 Feb 2016, Bruce Evans wrote:

> nfs3 is slower than in old versions of FreeBSD.  I debugged one of the
> reasons today.
> ...
> oldnfs was fixed many years ago to use timestamps with nanoseconds
> resolution, but it doesn't suffer from the discarding in nfs_open()
> in the !NMODIFIED case which is reached by either fsync() before close
> of commit on close.  I think this is because it updates n_mtime to
> the server's new timestamp in nfs_writerpc().  This seems to be wrong,
> since the file might have been written to by other clients and then
> the change would not be noticed until much later if ever (setting the
> timestamp prevents seeing it change when it is checked later, but you
> might be able to see another metadata change).
>
> newfs has quite different code for nfs_writerpc().  Most of it was
> moved to another function in nanother file.  I understand this even
> less, but it doesn't seem to have fetch the server's new timestamp or
> update n_mtime in the v3 case.

This quick fix seems to give the same behaviour as in oldnfs.  It also
fixes some bugs in comments in nfs_fsync() (where I tried to pass a
non-null cred, but none is available.  The ARGSUSED bug is in many
other functions):

X Index: nfs_clvnops.c
X ===================================================================
X --- nfs_clvnops.c	(revision 296089)
X +++ nfs_clvnops.c	(working copy)
X @@ -1425,6 +1425,23 @@
X  	}
X  	if (DOINGASYNC(vp))
X  		*iomode = NFSWRITE_FILESYNC;
X +	if (error == 0 && NFS_ISV3(vp)) {
X +		/*
X +		 * Break seeing concurrent changes by other clients,
X +		 * since without this the next nfs_open() would
X +		 * invalidate our write buffers.  This is worse than
X +		 * useless unless the write is committed on close or
X +		 * fsynced, since otherwise NMODIFIED remains set so
X +		 * the next nfs_open() will still invalidate the write
X +		 * buffers.  Unfortunately, this cannot be placed in
X +		 * ncl_flush() where NMODIFIED is cleared since
X +		 * credentials are unavailable there for at least
X +		 * calls by nfs_fsync().
X +		 */
X +		mtx_lock(&(VTONFS(vp))->n_mtx);
X +		VTONFS(vp)->n_mtime = nfsva.na_mtime;
X +		mtx_unlock(&(VTONFS(vp))->n_mtx);
X +	}
X  	if (error && NFS_ISV4(vp))
X  		error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0);
X  	return (error);
X @@ -2613,9 +2630,8 @@
X  }
X 
X  /*
X - * fsync vnode op. Just call ncl_flush() with commit == 1.
X + * fsync vnode op.
X   */
X -/* ARGSUSED */
X  static int
X  nfs_fsync(struct vop_fsync_args *ap)
X  {
X @@ -2622,8 +2638,12 @@
X 
X  	if (ap->a_vp->v_type != VREG) {
X  		/*
X +		 * XXX: this comment is misformatted (after fixing its
X +		 * internal errors) and misplaced.
X +		 *
X  		 * For NFS, metadata is changed synchronously on the server,
X -		 * so there is nothing to flush. Also, ncl_flush() clears
X +		 * so the only thing to flush is data for regular files.
X +		 * Also, ncl_flush() clears
X  		 * the NMODIFIED flag and that shouldn't be done here for
X  		 * directories.
X  		 */

> There are many other reasons why nfs is slower than in old versions.
> One is that writes are more often done out of order.  This tends to
> give a slowness factor of about 2 unless the server can fix up the
> order.  I use an old server which can do the fixup for old clients but
> not for newer clients starting in about FreeBSD-9 (or 7?).  I suspect
> that this is just because Giant locking in old clients gave accidental
> serialization.  Multiple nfsiod's and/or nfsd's are are clearly needed
> for performance if you have multiple NICs serving multiple mounts.
> Other cases are less clear.  For the iozone benchmark, there is only
> 1 stream and multiple nfsiod's pessimize it into multiple streams that
> give buffers which arrive out of order on the server if the multiple
> nfsiod's are actually active.  I use the following configuration to
> ameliorate this, but the slowness factor is still often about 2 for
> iozone:
> - limit nfsd's to 4
> - limit nfsiod's to 4
> - limit nfs i/o sizes to 8K.  The server fs block size is 16K, and
>  using a smaller block size usually helps by giving some delayed
>  writes which can be clustered better.  (The non-nfs parts of the
>  server could be smarter and do this intentionally.  The out-of-order
>  buffers look like random writes to the server.)  16K i/o sizes
>  otherwise work OK, but 32K i/o sizes are much slower for unknown
>  reasons.

Size 16K seems to work better now.

I also use:

- turn off most interrupt moderation.  This reduces (ping) latency from
   ~125 usec to ~75 usec for em on PCIe (after already turning off interrupt
   moderation on the server to reduce it from 150-200 usec).  75 usec
   is still a lot, though it is about 3 times lower than the default
   misconfiguration.  Downgrading up to older lem on PCI/33 reduces it to
   52.  Downgrading to DEVICE_POLLING reduces it to about 40.  The
   dowgrades are upgrades :-(.  Not using a switch reduces it by about
   another 20.

   Low latency important for small i/o's.  I was suprised that it also
   helps a lot for large i/o's.  Apparently it changes the timing enough
   to reduce the out-of-order buffers significantly.

The default misconfiguration with 20 nfsiod's is worse than I expected
(on an 8 core system).  For (old) "iozone auto" which starts with a file
size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and
22 MB/sec with 1 nfsiod.  2-4 nfsiod's work best.  They give 30-40MB/sec
for most file sizes.  Apparently, with 20 nfsiod's the write of 1MB is
split up into almost twenty pieces of 50K each (6 or 7 8K buffers each),
and the final order is perhaps even worse than random.  I think it is
basically sequential with about <number of nfsiods> seeks for all file
sizes between 1MB and many MB.

I also use:

- no PREEMPTION and no IPI_PREEMPTION on SMP systems.  This limits context
   switching.
- no SCHED_ULE.  HZ = 100.  This also limits context switching.

With more or fairer context switching, all nfsiods are more likely to run,
causing more damage.

More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and
mostly best known other tuning:

- first run write speed 2MB/S (probably still using 20)
   (all rates use disk marketing MB)
- second run 9MB/S
- after repeated runs, 250MB/S
- the speed kept mostly dropping, and reached 21K/S
- server stats for next run at 29K/S: 139 blocks tested and order of
   24 fixed (the server has an early version of what is in -current,
   with more debugging)

with nfsiodmax=20:
- most runs 2-2.2MB/S; one at 750K/S
- server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed

with nfsiodmax=4:
- 5.8-6.5MB/S
- server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed

with nfsiodmax=2:
- 4.8-5.2MB/S
- server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed

with nfsiodmax=1:
- 3.4MB/S
- server stats: 138 blocks tested and 0 fixed

For iozone 512 65536:

with nfsiodmax=1:
- 34.7MB/S
- server stats: 65543 blocks tested and 0 fixed

with nfsiodmax=2:
- 45.9MB/S (this is close to the drive's speed and faster than direct on the
   server.  It is faster because everything the clustering accidentally works
   better)
- server stats: 65550 blocks tested and 578 fixed

with nfsiodmax=4:
- 45.6MB/S
- server stats: 65550 blocks tested and 2067 fixed

with nfsiodmax=20:
- 21.4MB/S
- server stats: 65576 blocks tested and 12057 fixed
   (it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks
   out of order.  The server is fixing up almost 20%, but that is
   not enough)

with nfsiodmax=64 (caused server to not respond):
- test aborted at 500+MB
- server stats: about 10000 blocks fixed

with nfsiodmax=64 again:
- 9.6MB/S
- server stats: 65598 blocks tested and 14034 fixed

The nfsiod's get scheduled almost equally.

Bruce