From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 21:09:31 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BDA941065670; Thu, 25 Aug 2011 21:09:31 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 821DF8FC18; Thu, 25 Aug 2011 21:09:31 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 19EC946B06; Thu, 25 Aug 2011 17:09:31 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 9566C8A02E; Thu, 25 Aug 2011 17:09:30 -0400 (EDT) From: John Baldwin To: Bruce Evans Date: Thu, 25 Aug 2011 17:09:29 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; ) References: <201108251347.45460.jhb@freebsd.org> <20110826043611.D2962@besplex.bde.org> In-Reply-To: <20110826043611.D2962@besplex.bde.org> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201108251709.30072.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 25 Aug 2011 17:09:30 -0400 (EDT) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 21:09:31 -0000 On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote: > On Thu, 25 Aug 2011, John Baldwin wrote: > > > I was doing some analysis of compiles over NFS at work recently and noticed > > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > > writes (meaning that writes were never being clustered). I added some > > Did you see the old patches for this by Bjorn Gronwall? They went through > many iterations. He was mainly interested in the !async case and I was > mainly interested in the async case... Ah, no I had not seen these, thanks. > > and moved it into a function to compute a sequential I/O heuristic that > > could be shared by both reads and writes. I also updated the sequential > > heuristic code to advance the counter based on the number of 16k blocks > > in each write instead of just doing ++ to match what we do for local > > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > > give me some measure of NFS write clustering (though I can't peg my > > disks at MAXPHYS the way a dd to a file on a local filesystem can). The > > I got close to it. The failure modes were mostly burstiness of i/o, where > the server buffer cache seemed to fill up so the client would stop sending > and stay stopped for too long (several seconds; enough to reduce the > throughput by 40-60%). Hmm, I can get writes up to around 40-50k, but not 128k. My test is to just dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so. > > patch for these changes is at > > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > > > (This also fixes a bug in the new NFS server in that it wasn't actually > > clustering reads since it never updated nh->nh_nextr.) > > > > Combining the two changes together gave me about a 1% reduction in wall > > time for my builds: > > > > +------------------------------------------------------------------------------+ > > |+ + ++ + +x++*x xx+x x x| > > | |___________A__|_M_______|_A____________| | > > +------------------------------------------------------------------------------+ > > N Min Max Median Avg Stddev > > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > > Difference at 95.0% confidence > > -25.414 +/- 20.2391 > > -1.34742% +/- 1.07305% > > (Student's t, pooled s = 21.5402) > > > > One caveat: I tested both of these patches on the old NFS client and server > > on 8.2-stable. I then ported the changes to the new client and server and > > while I made sure they compiled, I have not tested the new client and server. > > Here is the version of Bjorn's patches that I last used (in 8-current in > 2008): > > % Index: nfs_serv.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v > % retrieving revision 1.182 > % diff -u -2 -r1.182 nfs_serv.c > % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182 > % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000 > % @@ -107,14 +107,18 @@ > % #define MAX_COMMIT_COUNT (1024 * 1024) > % > % -#define NUM_HEURISTIC 1017 > % +#define NUM_HEURISTIC 1031 /* Must be prime! */ > % +#define MAX_REORDERED_RPC 4 > % +#define HASH_MAXSTEP 0x3ff > % #define NHUSE_INIT 64 > % #define NHUSE_INC 16 > % #define NHUSE_MAX 2048 > % +#define NH_TAG(vp) ((uint32_t)((uintptr_t)vp / sizeof(struct vnode))) > % +CTASSERT(NUM_HEURISTIC > (HASH_MAXSTEP + 1)); Hmm, aside from 1017 not being prime (3 * 339), I'm not sure what the reasons for the rest of these changes are. > % > % static struct nfsheur { > % - struct vnode *nh_vp; /* vp to match (unreferenced pointer) */ > % - off_t nh_nextr; /* next offset for sequential detection */ > % - int nh_use; /* use count for selection */ > % - int nh_seqcount; /* heuristic */ > % + off_t nh_nextoff; /* next offset for sequential detection */ > % + uint32_t nh_tag; /* vp tag to match */ > % + uint16_t nh_use; /* use count for selection */ > % + uint16_t nh_seqcount; /* in units of logical blocks */ > % } nfsheur[NUM_HEURISTIC]; Hmm, not sure why this only stores the tag and uses uint16_t values here. The size difference is a few KB at best, and I'd rather store the full vnode to avoid oddities from hash collisions. > % @@ -131,8 +135,14 @@ > % static int nfs_commit_blks; > % static int nfs_commit_miss; > % +static int nfsrv_cluster_writes = 1; > % +static int nfsrv_cluster_reads = 1; > % +static int nfsrv_reordered_io; > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, async, CTLFLAG_RW, &nfs_async, 0, ""); > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_blks, CTLFLAG_RW, &nfs_commit_blks, 0, ""); > % SYSCTL_INT(_vfs_nfsrv, OID_AUTO, commit_miss, CTLFLAG_RW, &nfs_commit_miss, 0, ""); > % > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_writes, CTLFLAG_RW, &nfsrv_cluster_writes, 0, ""); > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, cluster_reads, CTLFLAG_RW, &nfsrv_cluster_reads, 0, ""); > % +SYSCTL_INT(_vfs_nfsrv, OID_AUTO, reordered_io, CTLFLAG_RW, &nfsrv_reordered_io, 0, ""); > % struct nfsrvstats nfsrvstats; > % SYSCTL_STRUCT(_vfs_nfsrv, NFS_NFSRVSTATS, nfsrvstats, CTLFLAG_RW, > % @@ -145,4 +155,73 @@ > % > % /* > % + * Detect sequential access so that we are able to hint the underlying > % + * file system to use clustered I/O when appropriate. > % + */ > % +static int > % +nfsrv_sequential_access(const struct uio *uio, const struct vnode *vp) > % +{ > % + struct nfsheur *nh; > % + unsigned hi, step; > % + int try = 8; > % + int nblocks, lblocksize; > % + > % + /* > % + * Locate best nfsheur[] candidate using double hashing. > % + */ > % + > % + hi = NH_TAG(vp) % NUM_HEURISTIC; > % + step = NH_TAG(vp) & HASH_MAXSTEP; > % + step++; /* Step must not be zero. */ > % + nh = &nfsheur[hi]; I can't speak to whether using a variable step makes an appreciable difference. I have not examined that in detail in my tests. > % + > % + while (try--) { > % + if (nfsheur[hi].nh_tag == NH_TAG(vp)) { > % + nh = &nfsheur[hi]; > % + break; > % + } > % + if (nfsheur[hi].nh_use > 0) > % + --nfsheur[hi].nh_use; > % + hi = hi + step; > % + if (hi >= NUM_HEURISTIC) > % + hi -= NUM_HEURISTIC; > % + if (nfsheur[hi].nh_use < nh->nh_use) > % + nh = &nfsheur[hi]; > % + } > % + > % + if (nh->nh_tag != NH_TAG(vp)) { /* New entry. */ > % + nh->nh_tag = NH_TAG(vp); > % + nh->nh_nextoff = uio->uio_offset; > % + nh->nh_use = NHUSE_INIT; > % + nh->nh_seqcount = 1; /* Initially assume sequential access. */ > % + } else { > % + nh->nh_use += NHUSE_INC; > % + if (nh->nh_use > NHUSE_MAX) > % + nh->nh_use = NHUSE_MAX; > % + } > % + > % + /* > % + * Calculate heuristic > % + */ > % + > % + lblocksize = vp->v_mount->mnt_stat.f_iosize; > % + nblocks = howmany(uio->uio_resid, lblocksize); This is similar to what I pulled out of sequential_heuristic() except that it doesn't hardcode 16k. There is a big comment above the 16k that says it isn't about the blocksize though, so I'm not sure which is most correct. I imagine we'd want to use the same strategy in both places though. Comment from vfs_vnops.c: /* * f_seqcount is in units of fixed-size blocks so that it * depends mainly on the amount of sequential I/O and not * much on the number of sequential I/O's. The fixed size * of 16384 is hard-coded here since it is (not quite) just * a magic size that works well here. This size is more * closely related to the best I/O size for real disks than * to any block size used by software. */ fp->f_seqcount += howmany(uio->uio_resid, 16384); > % + if (uio->uio_offset == nh->nh_nextoff) { > % + nh->nh_seqcount += nblocks; > % + if (nh->nh_seqcount > IO_SEQMAX) > % + nh->nh_seqcount = IO_SEQMAX; > % + } else if (uio->uio_offset == 0) { > % + /* Seek to beginning of file, ignored. */ > % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= > % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { > % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ Ah, this is a nice touch! I had noticed reordered I/O's resetting my clustered I/O count. I should try this extra step. > % + } else > % + nh->nh_seqcount /= 2; /* Not sequential access. */ Hmm, this is a bit different as well. sequential_heuristic() just drops all clustering (seqcount = 1) here so I had followed that. I do wonder if this change would be good for "normal" I/O as well? (Again, I think it would do well to have "normal" I/O and NFS generally use the same algorithm, but perhaps with the extra logic to handle reordered writes more gracefully for NFS.) > % + > % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; Interesting. So this assumes the I/O never fails. > % @@ -1225,4 +1251,5 @@ > % vn_finished_write(mntp); > % VFS_UNLOCK_GIANT(vfslocked); > % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ > % return(error); > % } Hmm, this seems to be related to avoiding overloading the NFS server's buffer cache? > % @@ -1492,4 +1519,6 @@ > % } > % if (!error) { > % + if (nfsrv_cluster_writes) > % + ioflags |= nfsrv_sequential_access(uiop, vp); > % error = VOP_WRITE(vp, uiop, ioflags, cred); > % /* XXXRW: unlocked write. */ > % @@ -1582,4 +1611,5 @@ > % } > % splx(s); > % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ > % return (0); > % } Ah, this code is no longer present in 8 (but is in 7). > % @@ -3827,5 +3857,9 @@ > % for_ret = VOP_GETATTR(vp, &bfor, cred, td); > % > % - if (cnt > MAX_COMMIT_COUNT) { > % + /* > % + * If count is 0, a flush from offset to the end of file > % + * should be performed according to RFC 1813. > % + */ > % + if (cnt == 0 || cnt > MAX_COMMIT_COUNT) { > % /* > % * Give up and do the whole thing This appears to be a seperate standalone fix. > % @@ -3871,5 +3905,5 @@ > % bo = &vp->v_bufobj; > % BO_LOCK(bo); > % - while (cnt > 0) { > % + while (!error && cnt > 0) { > % struct buf *bp; > % > % @@ -3894,5 +3928,5 @@ > % bremfree(bp); > % bp->b_flags &= ~B_ASYNC; > % - bwrite(bp); > % + error = bwrite(bp); > % ++nfs_commit_miss; I think you can just do a 'break' here without having to modify the while condition. This also seems to be an unrelated bugfix. > % } else > % Index: nfs_syscalls.c > % =================================================================== > % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v > % retrieving revision 1.119 > % diff -u -2 -r1.119 nfs_syscalls.c > % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119 > % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000 > % @@ -86,5 +86,4 @@ > % int nfsd_waiting = 0; > % int nfsrv_numnfsd = 0; > % -static int notstarted = 1; > % > % static int nfs_privport = 0; > % @@ -448,7 +447,6 @@ > % procrastinate = nfsrvw_procrastinate; > % NFSD_UNLOCK(); > % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && > % - nd->nd_procnum == NFSPROC_WRITE && > % - procrastinate > 0 && !notstarted)) > % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && > % + procrastinate > 0)) > % error = nfsrv_writegather(&nd, slp, > % nfsd->nfsd_td, &mreq); This no longer seems to be present in 8. > I have forgotten many details but can dig them out from large private > mails about this if you care. It looks like most of the above is about > getting the seqcount write; probably similar to what you did. -current > is still missing the error check for the bwrite() -- that's the only > bwrite() in the whole server. However, the fix isn't good -- one obvious > bug is that a later successful bwrite() destroys the evidence of a > previous bwrite() error. One thing I had done was to use a separate set of heuristics for reading vs writing. However, that is possibly dubious (and we don't do it for local I/O), so I can easily drop that feature if desired. -- John Baldwin