Date: Sun, 28 Feb 2016 15:38:58 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: fs@freebsd.org Subject: Re: silly write caching in nfs3 Message-ID: <20160228143858.W2389@besplex.bde.org> In-Reply-To: <813250886.11776455.1456622740632.JavaMail.zimbra@uoguelph.ca> References: <20160226164613.N2180@besplex.bde.org> <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca> <20160227153409.W1735@besplex.bde.org> <813250886.11776455.1456622740632.JavaMail.zimbra@uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, 27 Feb 2016, Rick Macklem wrote: > Bruce Evans wrote: >> ... >> I blamed newnfs before :-), but when I looked at newnfs more closely I >> found that it was almost the same lexically in the most interesting >> places (but unfortunately has lexical differences from s/nfs/ncl/, >> and but doesn't have enough of these differences for debugging -- >> debugging is broken by having 2 static functions named nfs_foo() for >> many values of foo). But newnfs seems to have always been missing this >> critical code: >> >> X 1541 rgrimes int >> X 83651 peter nfs_writerpc(struct vnode *vp, struct uio *uiop, struct >> ucred *cred, >> X 158739 mohans int *iomode, int *must_commit) >> X 1541 rgrimes { >> X 9336 dfr if (v3) { >> X 9336 dfr wccflag = NFSV3_WCCCHK; >> X ... >> X 158739 mohans } >> X 158739 mohans if (wccflag) { >> X 158739 mohans mtx_lock(&(VTONFS(vp))->n_mtx); >> X 158739 mohans VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime; >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> X 158739 mohans mtx_unlock(&(VTONFS(vp))->n_mtx); >> X 158739 mohans } >> > Well, this code does exist in the new client. The function is nfsrpc_writerpc() > found in sys/fs/nfsclient/nfs_clrpcops.c. It calls nfscl_wcc_data() which does > the same test as nfsm_wcc_data_xx() did in the old client. > Then NFSWRITERPC_SETTIME() set the time if wccflag was set. > > However...NFSWRITERPC_SETTIME() is broken and sets n_mtime from the one in > the nfs vnode (same as the old code). Unfortuantely the cached value in the nfs > vnode hasn't been updated at that point. (The RPC layer functions don't do cache > stuff.) > > The attached patch fixes this. Please test with the attached patch. This passes my tests. newfs now works slightly better than oldnfs in FreeBSD-10 (higher throughput in re-read test and fewer read RPCs in compile tests). The negative regression in read RPCs might be a bug somewhere. In a larger compile test FreeBSD-11 has the opposite problem of many more read RPCs. I haven't started isolating this. Previous tests isolated the following regressions in the number of RPCs: - r208602:208603 gives more correct attribute cache clearing but more RPCs - r209947:209948 gives the same - r247115:247116 gives a 5-10% increase in sys time for makeworld by doing lots of locking in a slow way (thread_lock() in sigdefer/allowstop()). Removing the thread locking and doing racy accesses to td_flags reduces this overhead to below 5%. Removing the calls to sigdefer/allowstop() reduces it to a hard-to-measure amount (thus uses even more flags in VFS_PROLOGUE/EPILOGUE()) and doesn't remove the overhead for other file systems, but branch prediction apparently works well unless there are function calls there). For building a single kernel, in FreeBSD-[8-10] the RPC counts are down by about 10% relative to my reference version (which has an old implementation of negative name caching and cto improvements and a hack to work around broken dotdot namecaching), but for makeworld they are up by about 5%. The increase is probably another namecache or dirents bug, or just from different cache timeouts. > ... >> I tried this to see if it would fix the unordered writes. I didn't >> expect it to do much because I usually only have a single active >> client and a single active writer per file. It didn't make much >> difference. Today I tested on a 2-core system. The loss from extra nfsiods was much smaller than on a faster 8-core system with each core 2-3 times as fast and a slower (max throughput 28MB/S down from 70) but slightly lower latency network (ping latency 63 usec down from 75) (no loss up to a file size of about 32MB). > ... >>> The patches were all client side. Maybe I'll try and recreate them. >> >> It seems to require lots of communication between separate nfsiods to >> even preserve an order that has carefully been set up for them. If >> you have this then it is unclear why it can't be done more simply using >> a single nfsiod thread (per NIC or ifq). Only 1 thread should talk to >> the NIC/ifq, since you lose control if put other threads in between. >> If the NIC/ifq uses multiple threads then maintaining the order is its >> problem. > > Yes. The patches I had didn't guarantee complete ordering in part because the > one was for the nfsiods and the other for the RPC request in the krpc code. > I will try and redo them one of these days. > > I will mention that the NFS RFCs have no requirement nor recommendation for > ordering. The RPCs are assumed to be atomic entities. Having said that, I > do believe that FreeBSD servers will perform better when the reads/writes are > in sequentially increasing byte order, so this is worth working on. FreeBSD servers do lots of clustering, but not very well. Write clustering is easier than read clustering since it is reasonable to wait a bit to combine writes. A single client with a single writer is an easy case. For a single client with multiple writers, I think the nfsiods should try to keep the writes sequential for each writer but not across writers. This gives an order similar to local writers. FreeBSD servers (except maybe zfs ones) don't handle multiple writers well, but an nfs client can't be expected to understand the server's deficiencies better that the server does so that they can be avoided. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20160228143858.W2389>