Date: Fri, 26 Feb 2016 23:00:47 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Bruce Evans <brde@optusnet.com.au> Cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 Message-ID: <1347742231.11086226.1456545647628.JavaMail.zimbra@uoguelph.ca> In-Reply-To: <20160226164613.N2180@besplex.bde.org> References: <20160226164613.N2180@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote: > nfs3 is slower than in old versions of FreeBSD. I debugged one of the > reasons today. > > Writes have apparently always done silly caching. Typical behaviour > is for iozone writing a 512MB file where the file fits in the buffer > cache/VMIO. The write is cached perfectly. But then when nfs_open() > reeopens the file, it calls vinvalbuf() to discard all of the cached > data. Thus nfs write caching usually discards useful older data to > make space for newer data that will never be never used (unless the > file is opened r/w and read using the same fd (and is not accessed > for a setattr or advlock operation -- these call vinvalbuf() too, if > NMODIFIED)). The discarding may be delayed for a long time. Then > keeping the useless data causes even more older data to be discarded. > Discarding it on close would at least prevent further loss. It would > have to be committed on close before discarding it of course. > Committing it on close does some good things even without discarding > there, and in oldnfs it gives a bug that prevents discaring in open -- > see below. > > nfs_open() does the discarding for different reasons in the NMODIFIED > and !NMODIFIED cases. In the NMODIFED case, it discard unconditionally. > This case can be avoided by fsync() before close or setting the sysctl > to commit in close. iozone does he fsync(). This helps in oldnfs but > not in newfs. With it, iozone on newfs now behaves like it did on oldnfs > 10-20 years ago. Something (perhaps just the timestamp bugs discussed > later) "fixed" the discarding on oldnfs 5-10 years ago. > > I think not committing in close is supposed to be an optimization, but > it is actually a pessimization for my kernel build tests (with object > files on nfs, which I normally avoid). Builds certainly have to reopen > files after writing them, to link them and perhaps to install them. > This causes the discarding. My kernel build tests also do a lot of > utimes() calls which cause the discarding before commit-on-close can > avoid the above cause for it it by clearing NMODIFIED. Enabling > commit-on-close gives a small optimisation with oldnfs by avoiding all > of the discarding except for utimes(). It reduces read RPCs by about > 25% without increasing write RPCs or real time. It decreases real time > by a few percent. > > The other reason for discarding is because the timestamps changed -- you > just wrote them, so the timestamps should have changed. Different bugs > in comparing the timestamps gave different misbehaviours. > You could easily test to see if second-resolution timestamps make a difference by redefining the NFS_TIMESPEC_COMPARE() macro { in sys/fs/nfsclient/nfsnode.h } so that it only compares the tv_sec field and not the tv_nsec field. --> Then the client would only think the mtime has changed when tv_sec changes. rick > In old versions of FreeBSD and/or nfs, the timestamps had seconds > granularity, so many changes were missed. This explains mysterious > behaviours by iozone 10-20 years ago: the write caching is seen to > work perfectly for most small total sizes, since all the writes take > less than 1 second so the timestamps usually don't change (but sometimes > the writes lie across a seconds boundary so the timestamps do change). > > oldnfs was fixed many years ago to use timestamps with nanoseconds > resolution, but it doesn't suffer from the discarding in nfs_open() > in the !NMODIFIED case which is reached by either fsync() before close > of commit on close. I think this is because it updates n_mtime to > the server's new timestamp in nfs_writerpc(). This seems to be wrong, > since the file might have been written to by other clients and then > the change would not be noticed until much later if ever (setting the > timestamp prevents seeing it change when it is checked later, but you > might be able to see another metadata change). > > newfs has quite different code for nfs_writerpc(). Most of it was > moved to another function in nanother file. I understand this even > less, but it doesn't seem to have fetch the server's new timestamp or > update n_mtime in the v3 case. > > There are many other reasons why nfs is slower than in old versions. > One is that writes are more often done out of order. This tends to > give a slowness factor of about 2 unless the server can fix up the > order. I use an old server which can do the fixup for old clients but > not for newer clients starting in about FreeBSD-9 (or 7?). I suspect > that this is just because Giant locking in old clients gave accidental > serialization. Multiple nfsiod's and/or nfsd's are are clearly needed > for performance if you have multiple NICs serving multiple mounts. > Other cases are less clear. For the iozone benchmark, there is only > 1 stream and multiple nfsiod's pessimize it into multiple streams that > give buffers which arrive out of order on the server if the multiple > nfsiod's are actually active. I use the following configuration to > ameliorate this, but the slowness factor is still often about 2 for > iozone: > - limit nfsd's to 4 > - limit nfsiod's to 4 > - limit nfs i/o sizes to 8K. The server fs block size is 16K, and > using a smaller block size usually helps by giving some delayed > writes which can be clustered better. (The non-nfs parts of the > server could be smarter and do this intentionally. The out-of-order > buffers look like random writes to the server.) 16K i/o sizes > otherwise work OK, but 32K i/o sizes are much slower for unknown > reasons. > > Bruce > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1347742231.11086226.1456545647628.JavaMail.zimbra>