Date: Sat, 27 Feb 2016 17:14:29 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 Message-ID: <20160227153409.W1735@besplex.bde.org> In-Reply-To: <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca> References: <20160226164613.N2180@besplex.bde.org> <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, 26 Feb 2016, Rick Macklem wrote: > Bruce Evans wrote: >> nfs3 is slower than in old versions of FreeBSD. I debugged one of the >> reasons today. >> >> Writes have apparently always done silly caching. Typical behaviour >> is for iozone writing a 512MB file where the file fits in the buffer >> cache/VMIO. The write is cached perfectly. But then when nfs_open() >> reeopens the file, it calls vinvalbuf() to discard all of the cached >> data. Thus nfs write caching usually discards useful older data to >> ... >> I think not committing in close is supposed to be an optimization, but >> it is actually a pessimization for my kernel build tests (with object >> files on nfs, which I normally avoid). Builds certainly have to reopen >> files after writing them, to link them and perhaps to install them. >> This causes the discarding. My kernel build tests also do a lot of >> utimes() calls which cause the discarding before commit-on-close can >> avoid the above cause for it it by clearing NMODIFIED. Enabling >> commit-on-close gives a small optimisation with oldnfs by avoiding all >> of the discarding except for utimes(). It reduces read RPCs by about >> 25% without increasing write RPCs or real time. It decreases real time >> by a few percent. >> > Well, the new NFS client code was cloned from the old one (about FreeBSD7). > I did this so that the new client wouldn't exhibit different caching > behaviour than the old one (avoiding any POLA). > If you look in stable/10/sys/nfsclient/nfs_vnops.c and stable/10/sys/fs/nfsclient/nfs_clvnops.c > at the nfs_open() and nfs_close() functions, the algorithm appears to be > identical for NFSv3. (The new one has a bunch of NFSv4 gunk, but if you > scratch out that stuff and ignore function name differences (nfs_flush() vs > ncl_flush()), I think you'll find them the same. I couldn't spot any > differences at a glance.) > --> see r214513 in head/sys/fs/nfsclient/nfs_clvnops.c for example I blamed newnfs before :-), but when I looked at newnfs more closely I found that it was almost the same lexically in the most interesting places (but unfortunately has lexical differences from s/nfs/ncl/, and but doesn't have enough of these differences for debugging -- debugging is broken by having 2 static functions named nfs_foo() for many values of foo). But newnfs seems to have always been missing this critical code: X 1541 rgrimes int X 83651 peter nfs_writerpc(struct vnode *vp, struct uio *uiop, struct ucred *cred, X 158739 mohans int *iomode, int *must_commit) X 1541 rgrimes { X 9336 dfr if (v3) { X 9336 dfr wccflag = NFSV3_WCCCHK; X ... X 158739 mohans } X 158739 mohans if (wccflag) { X 158739 mohans mtx_lock(&(VTONFS(vp))->n_mtx); X 158739 mohans VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ X 158739 mohans mtx_unlock(&(VTONFS(vp))->n_mtx); X 158739 mohans } This was in 4.4BSD-Lite1 under a slightly (?) different condition. BTW, how do you use svn to see the history of removed files? nfs_vnops.c has been removed in -current. I can find it in other branches, but is hard to find there even if you know where it is. This is no better than in cvs, where to find its full history I have cd know where it is in 3 different repositories that I have online and more that I should have. >> The other reason for discarding is because the timestamps changed -- you >> just wrote them, so the timestamps should have changed. Different bugs >> in comparing the timestamps gave different misbehaviours. >> >> In old versions of FreeBSD and/or nfs, the timestamps had seconds >> granularity, so many changes were missed. This explains mysterious >> behaviours by iozone 10-20 years ago: the write caching is seen to >> work perfectly for most small total sizes, since all the writes take >> less than 1 second so the timestamps usually don't change (but sometimes >> the writes lie across a seconds boundary so the timestamps do change). >> >> oldnfs was fixed many years ago to use timestamps with nanoseconds >> resolution, but it doesn't suffer from the discarding in nfs_open() >> in the !NMODIFIED case which is reached by either fsync() before close >> of commit on close. I think this is because it updates n_mtime to >> the server's new timestamp in nfs_writerpc(). This seems to be wrong, >> since the file might have been written to by other clients and then >> the change would not be noticed until much later if ever (setting the >> timestamp prevents seeing it change when it is checked later, but you >> might be able to see another metadata change). >> >> newfs has quite different code for nfs_writerpc(). Most of it was >> moved to another function in nanother file. I understand this even >> less, but it doesn't seem to have fetch the server's new timestamp or >> update n_mtime in the v3 case. >> > I'm pretty sure it does capture the new attributes (including mtime in > the reply. The function is called something like nfscl_loadattrcache(). Debugging shows that it loads the new attributes but doesn't clobber n_mtime with them. For a write test that takes 20 seconds, n_mtime sticks at its original value and the server time advances with each write by 20 seconds total (the server time only advances every second if the server timestamp precision is only 1 second). > In general, close-to-open consistency isn't needed for most mounts. > (The only case where it matters is when multiple clients are concurrently > updating files.) > - There are a couple of options that might help performance when doing > software builds on an NFS mount: > nocto (I remember you don't like the name) I actually do like it except for its negative logic. To turn it back on, you would need to use nonocto, but IIRC the negative logic for that is still broken (missing), so there is no way to turn it back on. > - Actually, I can't remember why the code would still do the cache > invalidation in nfs_open() when this is set. I wonder if the code > in nfs_open() should maybe avoid invalidating the buffer cache > when this is set? (I need to think about this.) I think it is technically correct for something to do the invalidation if NMODIFIED is still set in nfs_open(). nocto shouldn't and doesn't affect that. nocto is checked only in nfs_lookup() and only affects nfs_open() indirectly: its effect is that when nocto is not set, nfs_lookup() clears n_attrstamp which causes nfs_lookup() to do more, but hopefully still not cache invalidation. Cache invalidation is also done after a timeout and nocto doesn't affect that either. I still leave nocto off except for testing. I want to optimise the cto case, and my reference benchmarks are with cto. > noncontigwr - This one allows the writes to happen for byte aligned > chunks when they are non-contiguous without pushing the individual > writes to the server. (Again, this shouldn't cause problems unless > multiple clients are writing to the file concurrently.) > Both of these are worth trying for mounts where software builds are being > done. I tried this to see if it would fix the unordered writes. I didn't expect it to do much because I usually only have a single active client and a single active writer per file. It didn't make much difference. With nfsiods misordering writes, this option might give another source of silly writes. After it merges writes to give perfect contiguity, you send them to multiple nfsiods which might give perfect discontiguity (worse than random) :-). >> There are many other reasons why nfs is slower than in old versions. >> One is that writes are more often done out of order. This tends to >> give a slowness factor of about 2 unless the server can fix up the >> order. I use an old server which can do the fixup for old clients but >> not for newer clients starting in about FreeBSD-9 (or 7?). > I actually thought this was mainly caused by the krpc that was introduced > in FreeBSD7 (for both old and new NFS), separating the RPC from NFS. > There are 2 layers in the krpc (sys/rpc/clnt_rc.c and sys/rpc/clnt_vc.c) > that each use acquisition of a mutex to allow an RPC message to be sent. > (Whichever thread happens to acquire the mutex first, sends first.) I don't like the new krpc since it is larger and harder to debug (especially for me since I don't understand the old krpc either :-), but it is in FreeBSD-7 and in my main reference kernel r181717, and these don't have so many unordered blocks for at leasy writing. > I had a couple of patches that tried to keep the RPC messages more ordered. > (They would not have guaranteed exact ordering.) They seemed to help for > the limited testing I could do, but since I wasn't seeing a lot of > "out of order" reads/writes on my single core hardware, I couldn't verify I usually use single core hardware too, but saw some problems with 2 cores, and now with 8 cores the problems seem to be fundamental. > how well these patches worked. mav@ was working on this at the time, but > didn't get these patches tested either, from what I recall. > --> Unfortunately, I seem to have lost these patches or I would have > attached them so you could try them. Ouch. > (I've cc'd mav@. Maybe he'll have them lying about. I think one was > related to the nfsiod and the other for either sys/rpc/clnt_rc.c or > sys/rpc/clnt_vc.c.) > > The patches were all client side. Maybe I'll try and recreate them. It seems to require lots of communication between separate nfsiods to even preserve an order that has carefully been set up for them. If you have this then it is unclear why it can't be done more simply using a single nfsiod thread (per NIC or ifq). Only 1 thread should talk to the NIC/ifq, since you lose control if put other threads in between. If the NIC/ifq uses multiple threads then maintaining the order is its problem. >> I suspect >> that this is just because Giant locking in old clients gave accidental >> serialization. Multiple nfsiod's and/or nfsd's are are clearly needed >> for performance if you have multiple NICs serving multiple mounts. > Shared vnode locks are also a factor, at least for reads. > (Before shared vnode locks, the vnode lock essentially serialized all reads.) > > As you note, a single threaded benchmark test is quite different than a lot > of clients with a lot of threads doing I/O on a lot of files concurrently. It is also an important case for me. I mainly want creating of object files to be fast, and the cache invalidation and unordered blocks seem to be relatively even larger in this case. A typical compiler operation (if tmp or obj files are on nfs, which they shouldn't be but sometimes are) is: cc -pipe avoids creating intermediate file for preprocessor cc -c creates intermediate .S file for some compilers (not clang) .S file is written to cache reopening .S file to actually use it invalidates cache (workaround: (1) enable commit on close to clear NMODIFED, and (2) use 1-second timestamp resolution on server to break detection of the change, provided the file can be created and reopened without crossing a seconds boundary), or (3) use oldnfs cc -c creates intermediate .o file for all compilers similar considerations link step uses .o files and invalidates their cache in most cases (workaround: as above, except the whole compile usually takes more than 1 second, so the timestamp resolution hack doesn't work) install step similar considerations -- the linked file was the intermediate file for this step and reopening invalidates its cache. > The bandwidth * delay product of your network interconnect is also a factor. > The larger this is, the more bits you need to be in transit to "fill the data > pipe". You can increase the # of bits in transit by either using larger rsize/wsize > or more read-ahead/write-behind. I already have latency tuned to about 5-10 times smaller than on FreeBSD cluster machines, with the result that most operations are even more than 5-10 faster, due to smaller operations and most operations not having special support for keeping pipes full (if that is possible at all). > It would be nice to figure out why your case is performing better on the > old nfs client (and/or server). > > If you have a fairly recent FreeBSD10 system, you could try doing mounts > with new vs old client (and no other changes) and see what differences > occur. (that would isolate new vs old from recent "old" and "really old") Er, that is what I already did to isolate this problem. I have oldnfs and newfs in about 50 test kernels and finally isolated this problem in an up to date FreeBSD-10. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160227153409.W1735>