Date: Sun, 28 Feb 2016 17:34:25 -0500 (EST) From: Rick Macklem <rmacklem@uoguelph.ca> To: Bruce Evans <brde@optusnet.com.au> Cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 Message-ID: <631790565.12474657.1456698865377.JavaMail.zimbra@uoguelph.ca> In-Reply-To: <20160228143858.W2389@besplex.bde.org> References: <20160226164613.N2180@besplex.bde.org> <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca> <20160227153409.W1735@besplex.bde.org> <813250886.11776455.1456622740632.JavaMail.zimbra@uoguelph.ca> <20160228143858.W2389@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Bruce Evans wrote: > On Sat, 27 Feb 2016, Rick Macklem wrote: > > > Bruce Evans wrote: > >> ... > >> I blamed newnfs before :-), but when I looked at newnfs more closely I > >> found that it was almost the same lexically in the most interesting > >> places (but unfortunately has lexical differences from s/nfs/ncl/, > >> and but doesn't have enough of these differences for debugging -- > >> debugging is broken by having 2 static functions named nfs_foo() for > >> many values of foo). But newnfs seems to have always been missing this > >> critical code: > >> > >> X 1541 rgrimes int > >> X 83651 peter nfs_writerpc(struct vnode *vp, struct uio *uiop, > >> struct > >> ucred *cred, > >> X 158739 mohans int *iomode, int *must_commit) > >> X 1541 rgrimes { > >> X 9336 dfr if (v3) { > >> X 9336 dfr wccflag = NFSV3_WCCCHK; > >> X ... > >> X 158739 mohans } > >> X 158739 mohans if (wccflag) { > >> X 158739 mohans mtx_lock(&(VTONFS(vp))->n_mtx); > >> X 158739 mohans VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime; > >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > >> X 158739 mohans mtx_unlock(&(VTONFS(vp))->n_mtx); > >> X 158739 mohans } > >> > > Well, this code does exist in the new client. The function is > > nfsrpc_writerpc() > > found in sys/fs/nfsclient/nfs_clrpcops.c. It calls nfscl_wcc_data() which > > does > > the same test as nfsm_wcc_data_xx() did in the old client. > > Then NFSWRITERPC_SETTIME() set the time if wccflag was set. > > > > However...NFSWRITERPC_SETTIME() is broken and sets n_mtime from the one in > > the nfs vnode (same as the old code). Unfortuantely the cached value in the > > nfs > > vnode hasn't been updated at that point. (The RPC layer functions don't do > > cache > > stuff.) > > > > The attached patch fixes this. Please test with the attached patch. > > This passes my tests. newfs now works slightly better than oldnfs in > FreeBSD-10 (higher throughput in re-read test and fewer read RPCs in > compile tests). > Thanks for testing it. I can't do commits until mid-April, but will commit it then. > The negative regression in read RPCs might be a bug somewhere. In a larger > compile test FreeBSD-11 has the opposite problem of many more read RPCs. Hmm. There is very little difference between the FreeBSD-10 and FreeBSD-11 NFS code. (I have MFC'd most all of the changes.) Did you happen to look and see if the buffer cache was about the same size for both -10 and -11? > I haven't started isolating this. Previous tests isolated the following > regressions in the number of RPCs: > - r208602:208603 gives more correct attribute cache clearing but more RPCs > - r209947:209948 gives the same > - r247115:247116 gives a 5-10% increase in sys time for makeworld by doing > lots of locking in a slow way (thread_lock() in sigdefer/allowstop()). > Removing the thread locking and doing racy accesses to td_flags reduces > this overhead to below 5%. Removing the calls to sigdefer/allowstop() > reduces it to a hard-to-measure amount (thus uses even more flags in > VFS_PROLOGUE/EPILOGUE()) and doesn't remove the overhead for other file > systems, but branch prediction apparently works well unless there are > function calls there). > > For building a single kernel, in FreeBSD-[8-10] the RPC counts are > down by about 10% relative to my reference version (which has an old > implementation of negative name caching and cto improvements and a > hack to work around broken dotdot namecaching), but for makeworld they > are up by about 5%. The increase is probably another namecache or > dirents bug, or just from different cache timeouts. > > > ... > >> I tried this to see if it would fix the unordered writes. I didn't > >> expect it to do much because I usually only have a single active > >> client and a single active writer per file. It didn't make much > >> difference. > > Today I tested on a 2-core system. The loss from extra nfsiods was > much smaller than on a faster 8-core system with each core 2-3 times > as fast and a slower (max throughput 28MB/S down from 70) but slightly > lower latency network (ping latency 63 usec down from 75) (no loss up to > a file size of about 32MB). > > > ... > >>> The patches were all client side. Maybe I'll try and recreate them. > >> > >> It seems to require lots of communication between separate nfsiods to > >> even preserve an order that has carefully been set up for them. If > >> you have this then it is unclear why it can't be done more simply using > >> a single nfsiod thread (per NIC or ifq). Only 1 thread should talk to > >> the NIC/ifq, since you lose control if put other threads in between. > >> If the NIC/ifq uses multiple threads then maintaining the order is its > >> problem. > > > > Yes. The patches I had didn't guarantee complete ordering in part because > > the > > one was for the nfsiods and the other for the RPC request in the krpc code. > > I will try and redo them one of these days. > > > > I will mention that the NFS RFCs have no requirement nor recommendation for > > ordering. The RPCs are assumed to be atomic entities. Having said that, I > > do believe that FreeBSD servers will perform better when the reads/writes > > are > > in sequentially increasing byte order, so this is worth working on. I will try and come up with another patch to try and improve ordering. > > FreeBSD servers do lots of clustering, but not very well. > > Write clustering is easier than read clustering since it is reasonable to > wait a bit to combine writes. > > A single client with a single writer is an easy case. For a single client > with multiple writers, I think the nfsiods should try to keep the writes > sequential for each writer but not across writers. This gives an order > similar to local writers. FreeBSD servers (except maybe zfs ones) don't > handle multiple writers well, but an nfs client can't be expected to > understand the server's deficiencies better that the server does so that > they can be avoided. > The krpc just sees RPC requests (doesn't really know they are even NFS, although they are for the most part now.) As such, the patch I did (and will probably try and reproduce) just tried to keep the order (ie. retain the order that the krpc call got the RPC requests in). When I did a little testing, the nfsdiods didn't appear to be causing much reordering, but I should take another look at this. Thanks for all the info, rick > Bruce >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?631790565.12474657.1456698865377.JavaMail.zimbra>