FreeBSD Mail Archives

Date:      Sun, 28 Feb 2016 17:34:25 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        fs@freebsd.org
Subject:   Re: silly write caching in nfs3
Message-ID:  <631790565.12474657.1456698865377.JavaMail.zimbra@uoguelph.ca>
In-Reply-To: <20160228143858.W2389@besplex.bde.org>
References:  <20160226164613.N2180@besplex.bde.org> <1403082388.11082060.1456545103011.JavaMail.zimbra@uoguelph.ca> <20160227153409.W1735@besplex.bde.org> <813250886.11776455.1456622740632.JavaMail.zimbra@uoguelph.ca> <20160228143858.W2389@besplex.bde.org>

Bruce Evans wrote:
> On Sat, 27 Feb 2016, Rick Macklem wrote:
> 
> > Bruce Evans wrote:
> >> ...
> >> I blamed newnfs before :-), but when I looked at newnfs more closely I
> >> found that it was almost the same lexically in the most interesting
> >> places (but unfortunately has lexical differences from s/nfs/ncl/,
> >> and but doesn't have enough of these differences for debugging --
> >> debugging is broken by having 2 static functions named nfs_foo() for
> >> many values of foo).  But newnfs seems to have always been missing this
> >> critical code:
> >>
> >> X   1541    rgrimes int
> >> X  83651      peter nfs_writerpc(struct vnode *vp, struct uio *uiop,
> >> struct
> >> ucred *cred,
> >> X 158739     mohans 	     int *iomode, int *must_commit)
> >> X   1541    rgrimes {
> >> X   9336        dfr 		if (v3) {
> >> X   9336        dfr 			wccflag = NFSV3_WCCCHK;
> >> X ...
> >> X 158739     mohans 		}
> >> X 158739     mohans 		if (wccflag) {
> >> X 158739     mohans 			mtx_lock(&(VTONFS(vp))->n_mtx);
> >> X 158739     mohans 			VTONFS(vp)->n_mtime = VTONFS(vp)->n_vattr.va_mtime;
> >>                     			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >> X 158739     mohans 			mtx_unlock(&(VTONFS(vp))->n_mtx);
> >> X 158739     mohans 		}
> >>
> > Well, this code does exist in the new client. The function is
> > nfsrpc_writerpc()
> > found in sys/fs/nfsclient/nfs_clrpcops.c. It calls nfscl_wcc_data() which
> > does
> > the same test as nfsm_wcc_data_xx() did in the old client.
> > Then NFSWRITERPC_SETTIME() set the time if wccflag was set.
> >
> > However...NFSWRITERPC_SETTIME() is broken and sets n_mtime from the one in
> > the nfs vnode (same as the old code). Unfortuantely the cached value in the
> > nfs
> > vnode hasn't been updated at that point. (The RPC layer functions don't do
> > cache
> > stuff.)
> >
> > The attached patch fixes this. Please test with the attached patch.
> 
> This passes my tests.  newfs now works slightly better than oldnfs in
> FreeBSD-10 (higher throughput in re-read test and fewer read RPCs in
> compile tests).
> 
Thanks for testing it. I can't do commits until mid-April, but will commit
it then.

> The negative regression in read RPCs might be a bug somewhere.  In a larger
> compile test FreeBSD-11 has the opposite problem of many more read RPCs.
Hmm. There is very little difference between the FreeBSD-10 and FreeBSD-11 NFS
code. (I have MFC'd most all of the changes.)
Did you happen to look and see if the buffer cache was about the same size for
both -10 and -11?

> I haven't started isolating this.  Previous tests isolated the following
> regressions in the number of RPCs:
> - r208602:208603 gives more correct attribute cache clearing but more RPCs
> - r209947:209948 gives the same
> - r247115:247116 gives a 5-10% increase in sys time for makeworld by doing
>    lots of locking in a slow way (thread_lock() in sigdefer/allowstop()).
>    Removing the thread locking and doing racy accesses to td_flags reduces
>    this overhead to below 5%.  Removing the calls to sigdefer/allowstop()
>    reduces it to a hard-to-measure amount (thus uses even more flags in
>    VFS_PROLOGUE/EPILOGUE()) and doesn't remove the overhead for other file
>    systems, but branch prediction apparently works well unless there are
>    function calls there).
> 
> For building a single kernel, in FreeBSD-[8-10] the RPC counts are
> down by about 10% relative to my reference version (which has an old
> implementation of negative name caching and cto improvements and a
> hack to work around broken dotdot namecaching), but for makeworld they
> are up by about 5%.  The increase is probably another namecache or
> dirents bug, or just from different cache timeouts.
> 
> > ...
> >> I tried this to see if it would fix the unordered writes.  I didn't
> >> expect it to do much because I usually only have a single active
> >> client and a single active writer per file.  It didn't make much
> >> difference.
> 
> Today I tested on a 2-core system.  The loss from extra nfsiods was
> much smaller than on a faster 8-core system with each core 2-3 times
> as fast and a slower (max throughput 28MB/S down from 70) but slightly
> lower latency network (ping latency 63 usec down from 75) (no loss up to
> a file size of about 32MB).
> 
> > ...
> >>> The patches were all client side. Maybe I'll try and recreate them.
> >>
> >> It seems to require lots of communication between separate nfsiods to
> >> even preserve an order that has carefully been set up for them.  If
> >> you have this then it is unclear why it can't be done more simply using
> >> a single nfsiod thread (per NIC or ifq).  Only 1 thread should talk to
> >> the NIC/ifq, since you lose control if put other threads in between.
> >> If the NIC/ifq uses multiple threads then maintaining the order is its
> >> problem.
> >
> > Yes. The patches I had didn't guarantee complete ordering in part because
> > the
> > one was for the nfsiods and the other for the RPC request in the krpc code.
> > I will try and redo them one of these days.
> >
> > I will mention that the NFS RFCs have no requirement nor recommendation for
> > ordering. The RPCs are assumed to be atomic entities. Having said that, I
> > do believe that FreeBSD servers will perform better when the reads/writes
> > are
> > in sequentially increasing byte order, so this is worth working on.
I will try and come up with another patch to try and improve ordering.

> 
> FreeBSD servers do lots of clustering, but not very well.
> 
> Write clustering is easier than read clustering since it is reasonable to
> wait a bit to combine writes.
> 
> A single client with a single writer is an easy case.  For a single client
> with multiple writers, I think the nfsiods should try to keep the writes
> sequential for each writer but not across writers.  This gives an order
> similar to local writers.  FreeBSD servers (except maybe zfs ones) don't
> handle multiple writers well, but an nfs client can't be expected to
> understand the server's deficiencies better that the server does so that
> they can be avoided.
> 
The krpc just sees RPC requests (doesn't really know they are even NFS, although
they are for the most part now.) As such, the patch I did (and will probably try
and reproduce) just tried to keep the order (ie. retain the order that the krpc
call got the RPC requests in).

When I did a little testing, the nfsdiods didn't appear to be causing much reordering,
but I should take another look at this.

Thanks for all the info, rick

> Bruce
>

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?631790565.12474657.1456698865377.JavaMail.zimbra>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation