Date: Sat, 27 Feb 2016 15:21:02 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Bruce Evans <brde@optusnet.com.au> Cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 Message-ID: <20160227131353.V1337@besplex.bde.org> In-Reply-To: <20160226164613.N2180@besplex.bde.org> References: <20160226164613.N2180@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 26 Feb 2016, Bruce Evans wrote: > nfs3 is slower than in old versions of FreeBSD. I debugged one of the > reasons today. > ... > oldnfs was fixed many years ago to use timestamps with nanoseconds > resolution, but it doesn't suffer from the discarding in nfs_open() > in the !NMODIFIED case which is reached by either fsync() before close > of commit on close. I think this is because it updates n_mtime to > the server's new timestamp in nfs_writerpc(). This seems to be wrong, > since the file might have been written to by other clients and then > the change would not be noticed until much later if ever (setting the > timestamp prevents seeing it change when it is checked later, but you > might be able to see another metadata change). > > newfs has quite different code for nfs_writerpc(). Most of it was > moved to another function in nanother file. I understand this even > less, but it doesn't seem to have fetch the server's new timestamp or > update n_mtime in the v3 case. This quick fix seems to give the same behaviour as in oldnfs. It also fixes some bugs in comments in nfs_fsync() (where I tried to pass a non-null cred, but none is available. The ARGSUSED bug is in many other functions): X Index: nfs_clvnops.c X =================================================================== X --- nfs_clvnops.c (revision 296089) X +++ nfs_clvnops.c (working copy) X @@ -1425,6 +1425,23 @@ X } X if (DOINGASYNC(vp)) X *iomode = NFSWRITE_FILESYNC; X + if (error == 0 && NFS_ISV3(vp)) { X + /* X + * Break seeing concurrent changes by other clients, X + * since without this the next nfs_open() would X + * invalidate our write buffers. This is worse than X + * useless unless the write is committed on close or X + * fsynced, since otherwise NMODIFIED remains set so X + * the next nfs_open() will still invalidate the write X + * buffers. Unfortunately, this cannot be placed in X + * ncl_flush() where NMODIFIED is cleared since X + * credentials are unavailable there for at least X + * calls by nfs_fsync(). X + */ X + mtx_lock(&(VTONFS(vp))->n_mtx); X + VTONFS(vp)->n_mtime = nfsva.na_mtime; X + mtx_unlock(&(VTONFS(vp))->n_mtx); X + } X if (error && NFS_ISV4(vp)) X error = nfscl_maperr(uiop->uio_td, error, (uid_t)0, (gid_t)0); X return (error); X @@ -2613,9 +2630,8 @@ X } X X /* X - * fsync vnode op. Just call ncl_flush() with commit == 1. X + * fsync vnode op. X */ X -/* ARGSUSED */ X static int X nfs_fsync(struct vop_fsync_args *ap) X { X @@ -2622,8 +2638,12 @@ X X if (ap->a_vp->v_type != VREG) { X /* X + * XXX: this comment is misformatted (after fixing its X + * internal errors) and misplaced. X + * X * For NFS, metadata is changed synchronously on the server, X - * so there is nothing to flush. Also, ncl_flush() clears X + * so the only thing to flush is data for regular files. X + * Also, ncl_flush() clears X * the NMODIFIED flag and that shouldn't be done here for X * directories. X */ > There are many other reasons why nfs is slower than in old versions. > One is that writes are more often done out of order. This tends to > give a slowness factor of about 2 unless the server can fix up the > order. I use an old server which can do the fixup for old clients but > not for newer clients starting in about FreeBSD-9 (or 7?). I suspect > that this is just because Giant locking in old clients gave accidental > serialization. Multiple nfsiod's and/or nfsd's are are clearly needed > for performance if you have multiple NICs serving multiple mounts. > Other cases are less clear. For the iozone benchmark, there is only > 1 stream and multiple nfsiod's pessimize it into multiple streams that > give buffers which arrive out of order on the server if the multiple > nfsiod's are actually active. I use the following configuration to > ameliorate this, but the slowness factor is still often about 2 for > iozone: > - limit nfsd's to 4 > - limit nfsiod's to 4 > - limit nfs i/o sizes to 8K. The server fs block size is 16K, and > using a smaller block size usually helps by giving some delayed > writes which can be clustered better. (The non-nfs parts of the > server could be smarter and do this intentionally. The out-of-order > buffers look like random writes to the server.) 16K i/o sizes > otherwise work OK, but 32K i/o sizes are much slower for unknown > reasons. Size 16K seems to work better now. I also use: - turn off most interrupt moderation. This reduces (ping) latency from ~125 usec to ~75 usec for em on PCIe (after already turning off interrupt moderation on the server to reduce it from 150-200 usec). 75 usec is still a lot, though it is about 3 times lower than the default misconfiguration. Downgrading up to older lem on PCI/33 reduces it to 52. Downgrading to DEVICE_POLLING reduces it to about 40. The dowgrades are upgrades :-(. Not using a switch reduces it by about another 20. Low latency important for small i/o's. I was suprised that it also helps a lot for large i/o's. Apparently it changes the timing enough to reduce the out-of-order buffers significantly. The default misconfiguration with 20 nfsiod's is worse than I expected (on an 8 core system). For (old) "iozone auto" which starts with a file size of 1MB, the write speed is about 2MB/sec with 20 nfsiod's and 22 MB/sec with 1 nfsiod. 2-4 nfsiod's work best. They give 30-40MB/sec for most file sizes. Apparently, with 20 nfsiod's the write of 1MB is split up into almost twenty pieces of 50K each (6 or 7 8K buffers each), and the final order is perhaps even worse than random. I think it is basically sequential with about <number of nfsiods> seeks for all file sizes between 1MB and many MB. I also use: - no PREEMPTION and no IPI_PREEMPTION on SMP systems. This limits context switching. - no SCHED_ULE. HZ = 100. This also limits context switching. With more or fairer context switching, all nfsiods are more likely to run, causing more damage. More detailed result for iozone 1 65536 with nfsiodmax=64 and oldnfs and mostly best known other tuning: - first run write speed 2MB/S (probably still using 20) (all rates use disk marketing MB) - second run 9MB/S - after repeated runs, 250MB/S - the speed kept mostly dropping, and reached 21K/S - server stats for next run at 29K/S: 139 blocks tested and order of 24 fixed (the server has an early version of what is in -current, with more debugging) with nfsiodmax=20: - most runs 2-2.2MB/S; one at 750K/S - server stats for a run at 2.2MB/S: 135 blocks tested and 86 fixed with nfsiodmax=4: - 5.8-6.5MB/S - server stats for a run at 6.0MB/S: 135 blocks tested and 0 fixed with nfsiodmax=2: - 4.8-5.2MB/S - server stats for a run at 5.1MB/S: 138 blocks tested and 0 fixed with nfsiodmax=1: - 3.4MB/S - server stats: 138 blocks tested and 0 fixed For iozone 512 65536: with nfsiodmax=1: - 34.7MB/S - server stats: 65543 blocks tested and 0 fixed with nfsiodmax=2: - 45.9MB/S (this is close to the drive's speed and faster than direct on the server. It is faster because everything the clustering accidentally works better) - server stats: 65550 blocks tested and 578 fixed with nfsiodmax=4: - 45.6MB/S - server stats: 65550 blocks tested and 2067 fixed with nfsiodmax=20: - 21.4MB/S - server stats: 65576 blocks tested and 12057 fixed (it is easy to see how 7 nfsiods could give 1/7 = 14% of blocks out of order. The server is fixing up almost 20%, but that is not enough) with nfsiodmax=64 (caused server to not respond): - test aborted at 500+MB - server stats: about 10000 blocks fixed with nfsiodmax=64 again: - 9.6MB/S - server stats: 65598 blocks tested and 14034 fixed The nfsiod's get scheduled almost equally. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160227131353.V1337>