From owner-freebsd-fs@freebsd.org Sat Feb 27 04:21:15 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CA03CAB5CCF for ; Sat, 27 Feb 2016 04:21:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id B8CBAED7 for ; Sat, 27 Feb 2016 04:21:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id B801AAB5CCE; Sat, 27 Feb 2016 04:21:15 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B7988AB5CCD for ; Sat, 27 Feb 2016 04:21:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 69A78ED6 for ; Sat, 27 Feb 2016 04:21:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c110-21-41-193.carlnfd1.nsw.optusnet.com.au (c110-21-41-193.carlnfd1.nsw.optusnet.com.au [110.21.41.193]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id E733F1046FF0; Sat, 27 Feb 2016 15:21:04 +1100 (AEDT) Date: Sat, 27 Feb 2016 15:21:02 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 In-Reply-To: <20160226164613.N2180@besplex.bde.org> Message-ID: <20160227131353.V1337@besplex.bde.org> References: <20160226164613.N2180@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=73JWPhLeruqQCjN69UNZtQ==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=MMWvXMvT1jwRzmCVoGYA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 27 Feb 2016 04:21:16 -0000 On Fri, 26 Feb 2016, Bruce Evans wrote: > nfs3 is slower than in old versions of FreeBSD. I debugged one of the > reasons today. > ... > oldnfs was fixed many years ago to use timestamps with nanoseconds > resolution, but it doesn't suffer from the discarding in nfs_open() > in the !NMODIFIED case which is reached by either fsync() before close > of commit on close. I think this is because it updates n_mtime to > the server's new timestamp in nfs_writerpc(). This seems to be wrong, > since the file might have been written to by other clients and then > the change would not be noticed until much later if ever (setting the > timestamp prevents seeing it change when it is checked later, but you > might be able to see another metadata change). > > newfs has quite different code for nfs_writerpc(). Most of it was > moved to another function in nanother file. I understand this even > less, but it doesn't seem to have fetch the server's new timestamp or > update n_mtime in the v3 case. This quick fix seems to give the same behaviour as in oldnfs. It also fixes some bugs in comments in nfs_fsync() (where I tried to pass a non-null cred, but none is available. The ARGSUSED bug is in many other functions): X Index: nfs_clvnops.c X =================================================================== X --- nfs_clvnops.c (revision 296089) X +++ nfs_clvnops.c (working copy) X @@ -1425,6 +1425,23 @@ X } X if (DOINGASYNC(vp)) X *iomode = NFSWRITE_FILESYNC; X + if (error == 0 && NFS_ISV3(vp)) { X + /* X + * Break seeing concurrent changes by other clients, X + * since without this the next nfs_open() would X + * invalidate our write buffers. This is worse than X + * useless unless the write is committed on close or X + * fsynced, since otherwise NMODIFIED remains set so X + * the next nfs_open() will still invalidate the write X + * buffers. Unfortunately, this cannot be placed in X + * ncl_flush() where NMODIFIED is cleared since X + * credentials are unavailable there for at least X + * calls by nfs_fsync(). X + */ X + mtx_lock(&(VTONFS(vp))->n_mtx); X + VTONFS(vp)->n_mtime = nfsva.na_mtime; X + mtx_unlock(&(VTONFS(vp))->n_mtx); X + } X if (error && NFS_ISV4(vp)) X error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0); X return (error); X @@ -2613,9 +2630,8 @@ X } X X /* X - * fsync vnode op. Just call ncl_flush() with commit == 1. X + * fsync vnode op. X */ X -/* ARGSUSED */ X static int X nfs_fsync(struct vop_fsync_args *ap) X { X @@ -2622,8 +2638,12 @@ X X if (ap->a_vp->v_type != VREG) { X /* X + * XXX: this comment is misformatted (after fixing its X + * internal errors) and misplaced. X + * X * For NFS, metadata is changed synchronously on the server, X - * so there is nothing to flush. Also, ncl_flush() clears X + * so the only thing to flush is data for regular files. X + * Also, ncl_flush() clears X * the NMODIFIED flag and that shouldn't be done here for X * directories. X */ > There are many other reasons why nfs is slower than in old versions. > One is that writes are more often done out of order. This tends to > give a slowness factor of about 2 unless the server can fix up the > order. I use an old server which can do the fixup for old clients but > not for newer clients starting in about FreeBSD-9 (or 7?). I suspect > that this is just because Giant locking in old clients gave accidental > serialization. Multiple nfsiod's and/or nfsd's are are clearly needed > for performance if you have multiple NICs serving multiple mounts. > Other cases are less clear. For the iozone benchmark, there is only > 1 stream and multiple nfsiod's pessimize it into multiple streams that > give buffers which arrive out of order on the server if the multiple > nfsiod's are actually active. I use the following configuration to > ameliorate this, but the slowness factor is still often about 2 for > iozone: > - limit nfsd's to 4 > - limit nfsiod's to 4 > - limit nfs i/o sizes to 8K. The server fs block size is 16K, and > using a smaller block size usually helps by giving some delayed > writes which can be clustered better. (The non-nfs parts of the > server could be smarter and do this intentionally. The out-of-order > buffers look like random writes to the server.) 16K i/o sizes > otherwise work OK, but 32K i/o sizes are much slower for unknown > reasons. Size 16K seems to work better now. I also use: - turn off most interrupt moderation. This reduces (ping) latency from ~125 usec to ~75 usec for em on PCIe (after already turning off interrupt moderation on the server to reduce it from 150-200 usec). 75 usec is still a lot, though it is about 3 times lower than the default misconfiguration. Downgrading up to older lem on PCI/33 reduces it to 52. Downgrading to DEVICE_POLLING reduces it to about 40. The dowgrades are upgrades :-(. Not using a switch reduces it by about another 20. Low latency important for small i/o's. I was suprised that it also helps a lot for large i/o's. Apparently it changes the timing enough to reduce the out-of-order buffers significantly. The default misconfiguration with 20 nfsiod's is worse than I expected (on an 8 core system). For (old) "iozone auto" which starts with a file size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and 22 MB/sec with 1 nfsiod. 2-4 nfsiod's work best. They give 30-40MB/sec for most file sizes. Apparently, with 20 nfsiod's the write of 1MB is split up into almost twenty pieces of 50K each (6 or 7 8K buffers each), and the final order is perhaps even worse than random. I think it is basically sequential with about seeks for all file sizes between 1MB and many MB. I also use: - no PREEMPTION and no IPI_PREEMPTION on SMP systems. This limits context switching. - no SCHED_ULE. HZ = 100. This also limits context switching. With more or fairer context switching, all nfsiods are more likely to run, causing more damage. More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and mostly best known other tuning: - first run write speed 2MB/S (probably still using 20) (all rates use disk marketing MB) - second run 9MB/S - after repeated runs, 250MB/S - the speed kept mostly dropping, and reached 21K/S - server stats for next run at 29K/S: 139 blocks tested and order of 24 fixed (the server has an early version of what is in -current, with more debugging) with nfsiodmax=20: - most runs 2-2.2MB/S; one at 750K/S - server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed with nfsiodmax=4: - 5.8-6.5MB/S - server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed with nfsiodmax=2: - 4.8-5.2MB/S - server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed with nfsiodmax=1: - 3.4MB/S - server stats: 138 blocks tested and 0 fixed For iozone 512 65536: with nfsiodmax=1: - 34.7MB/S - server stats: 65543 blocks tested and 0 fixed with nfsiodmax=2: - 45.9MB/S (this is close to the drive's speed and faster than direct on the server. It is faster because everything the clustering accidentally works better) - server stats: 65550 blocks tested and 578 fixed with nfsiodmax=4: - 45.6MB/S - server stats: 65550 blocks tested and 2067 fixed with nfsiodmax=20: - 21.4MB/S - server stats: 65576 blocks tested and 12057 fixed (it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks out of order. The server is fixing up almost 20%, but that is not enough) with nfsiodmax=64 (caused server to not respond): - test aborted at 500+MB - server stats: about 10000 blocks fixed with nfsiodmax=64 again: - 9.6MB/S - server stats: 65598 blocks tested and 14034 fixed The nfsiod's get scheduled almost equally. Bruce