From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 17:45:52 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 7D7B41065679; Fri, 26 Aug 2011 17:45:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail07.syd.optusnet.com.au (mail07.syd.optusnet.com.au [211.29.132.188]) by mx1.freebsd.org (Postfix) with ESMTP id C9DAD8FC12; Fri, 26 Aug 2011 17:45:51 +0000 (UTC) Received: from c122-106-165-191.carlnfd1.nsw.optusnet.com.au (c122-106-165-191.carlnfd1.nsw.optusnet.com.au [122.106.165.191]) by mail07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id p7QHjlmm001436 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 27 Aug 2011 03:45:49 +1000 Date: Sat, 27 Aug 2011 03:45:47 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John Baldwin In-Reply-To: <201108251709.30072.jhb@freebsd.org> Message-ID: <20110827012609.H859@besplex.bde.org> References: <201108251347.45460.jhb@freebsd.org> <20110826043611.D2962@besplex.bde.org> <201108251709.30072.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 17:45:52 -0000 On Thu, 25 Aug 2011, John Baldwin wrote: > On Thursday, August 25, 2011 3:24:15 pm Bruce Evans wrote: >> On Thu, 25 Aug 2011, John Baldwin wrote: >> >>> I was doing some analysis of compiles over NFS at work recently and noticed >>> from 'iostat 1' on the NFS server that all my NFS writes were always 16k >>> writes (meaning that writes were never being clustered). I added some >> >> Did you see the old patches for this by Bjorn Gronwall? They went through >> many iterations. He was mainly interested in the !async case and I was >> mainly interested in the async case... > > Ah, no I had not seen these, thanks. I looked at your patches after writing the above. They look very similar, but the details are intricate. Unfortunately I forget most of the details. I reran some simple benchmarks (just iozone on a very old (~5.2) nfs client with various mount options, with netstat and systat to watch the resulting i/o on the server) on 3 different servers (~5.2 with Bjorn's patches, 8-current-2008 with Bjorn's patches, and -current-2011- March). The old client has many throughput problems, but strangely most of them are fixed by changing the server. >>> and moved it into a function to compute a sequential I/O heuristic that >>> could be shared by both reads and writes. I also updated the sequential >>> heuristic code to advance the counter based on the number of 16k blocks >>> in each write instead of just doing ++ to match what we do for local >>> file writes in sequential_heuristic() in vfs_vnops.c. Using this did >>> give me some measure of NFS write clustering (though I can't peg my >>> disks at MAXPHYS the way a dd to a file on a local filesystem can). The >> >> I got close to it. The failure modes were mostly burstiness of i/o, where >> the server buffer cache seemed to fill up so the client would stop sending >> and stay stopped for too long (several seconds; enough to reduce the >> throughput by 40-60%). > > Hmm, I can get writes up to around 40-50k, but not 128k. My test is to just > dd from /dev/zero to a file on the NFS client using a blocksize of 64k or so. I get mostly over 60K with old ata drivers that have a limit of 64K and mostly over 128K with not so old ata drivers that have a limit of 128. This is almost independent of the nfs client and server versions and mount options. I mostly tested async mounts, and mostly with an i/o size of just 512 for iozone (old-iozone 1024 512). It actually helps a little to have a minimal i/o size at the syscall level (to minimize latency at other levels; depends on CPU keeping up and kernel reblocking to better sizes). Throughputs with client defaults (-U,-r8192(?),-w8192(?), async,noatime) in 1e9 bytes were approximately: write read local disk: 48 53 5.2 server: 46 39 some bug usually makes the read direction slow 8 server: 46 39 cur server: 32 50+(?) writes 2/3 as fast due to not having patches but reads fixed (may also require tcp) Async on the server makes little difference. Contrary to what I said before, async on the client makes a big difference (it controls FILESYNC in a critical place). Now with noasync on the client: 5.2 server: 15 8 server: similar cur server: similar (worse, but not nearly 3/2 slower IIRC) There are just too many sync writes without async. But this is apparently mostly due to the default udp r/w sizes being too small, since tcp does much better, I think only due to its larger r/w sizes (I mostly don't use it because it has worse latency and more bugs in old nfs clients). Now with noasync,-T[-r32768(?),-w(32768)] on the client: 5.2 server: 34 37 8 server: 40+ (?) cur server: not tested The improvement is much larger for 8-server than for 5.2-server. That might be due to better tcp support, but I fear it is because 8-server is missing my fixes for ffs_update(). (The server file system was always ffs mounted async. Long ago, I got dyson to make fsync sort of work even when the file system is mounted async. VOP_FSYNC() writes data but not directory entries or inodes, except in my version it writes inodes. But actually writing the inode for every nfs FILESYNC probably doubles the number of i/o's. This is ameliorated as usual by a large i/o size at all levels, and by the disk lieing about actually writing the data so that doubling the number of writes doesn't give a full 2 times slowdown (I use old low end ATA disks with write caching enabled).) Now with async,-T[-r32768(?),-w(32768)] on the client: 5.2 server: 37 40 example of tcp not working well with 5.2 8 server: not carefully tested (similar to -U) cur server: not carefully tested (similar to -U) In other tests, toggling tcp/ucp and changing the block sizes makes hard to explain but not very important differences. It only magically fixes the case of an async client. My LAN uses a cheap switch but works almost perfectly for nfs over udp. I now remember that Bjorn was most interested in improving clustering for the noasync case. Clustering should happen almost automatically for the async case. Then lots of async writes should accumulate on the server and be written by a large cluster write. Any clustering at the nfs level would just get in the way. For the noasync case, FILESYNC will get in the way whenever it happens and it happens a lot, so I'm not sure how the server much opportunity for clustering. >>> patch for these changes is at >>> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch >>> >>> (This also fixes a bug in the new NFS server in that it wasn't actually >>> clustering reads since it never updated nh->nh_nextr.) I'm still looking for the bug that makes reads slower. It doesn't seem to be clustering. >> Here is the version of Bjorn's patches that I last used (in 8-current in >> 2008): >> >> % Index: nfs_serv.c >> % =================================================================== >> % RCS file: /home/ncvs/src/sys/nfsserver/nfs_serv.c,v >> % retrieving revision 1.182 >> % diff -u -2 -r1.182 nfs_serv.c >> % --- nfs_serv.c 28 May 2008 16:23:17 -0000 1.182 >> % +++ nfs_serv.c 1 Jun 2008 05:52:45 -0000 >> ... >> % + /* >> % + * Locate best nfsheur[] candidate using double hashing. >> % + */ >> % + >> % + hi = NH_TAG(vp) % NUM_HEURISTIC; >> % + step = NH_TAG(vp) & HASH_MAXSTEP; >> % + step++; /* Step must not be zero. */ >> % + nh = &nfsheur[hi]; > > I can't speak to whether using a variable step makes an appreciable > difference. I have not examined that in detail in my tests. Generally, only small differences can be made by tuning hash methods. >> % + /* >> % + * Calculate heuristic >> % + */ >> % + >> % + lblocksize = vp->v_mount->mnt_stat.f_iosize; >> % + nblocks = howmany(uio->uio_resid, lblocksize); > > This is similar to what I pulled out of sequential_heuristic() except > that it doesn't hardcode 16k. There is a big comment above the 16k > that says it isn't about the blocksize though, so I'm not sure which is > most correct. I imagine we'd want to use the same strategy in both places > though. Comment from vfs_vnops.c: > > /* > * f_seqcount is in units of fixed-size blocks so that it > * depends mainly on the amount of sequential I/O and not > * much on the number of sequential I/O's. The fixed size > * of 16384 is hard-coded here since it is (not quite) just > * a magic size that works well here. This size is more > * closely related to the best I/O size for real disks than > * to any block size used by software. > */ > fp->f_seqcount += howmany(uio->uio_resid, 16384); Probably this doesn't matter. The above code in vfs_vnops.c is mostly by me. I think it is newer than the code in nfs_serv.c (strictly older, but nfs_serv.c has not caught up with it). I played a bit more with this in nfs_serv.c, to see if this should be different in nfs. In my local version, lblocksize can be set by a sysctl. But I only used this sysctl for testing, and don't remember it making any interesting differences. >> % + if (uio->uio_offset == nh->nh_nextoff) { >> % + nh->nh_seqcount += nblocks; >> % + if (nh->nh_seqcount > IO_SEQMAX) >> % + nh->nh_seqcount = IO_SEQMAX; >> % + } else if (uio->uio_offset == 0) { >> % + /* Seek to beginning of file, ignored. */ >> % + } else if (qabs(uio->uio_offset - nh->nh_nextoff) <= >> % + MAX_REORDERED_RPC*imax(lblocksize, uio->uio_resid)) { >> % + nfsrv_reordered_io++; /* Probably reordered RPC, do nothing. */ > > Ah, this is a nice touch! I had noticed reordered I/O's resetting my > clustered I/O count. I should try this extra step. Stats after a few GB of i/o: % vfs.nfsrv.commit_blks: 138037 % vfs.nfsrv.commit_miss: 2844 % vfs.nfsrv.reordered_io: 5170 % vfs.nfsrv.realign_test: 492003 % vfs.nfsrv.realign_count: 0 There were only a few reorderings. In old testing, I seemed to get best results by turning the number of nfsd's down to 1. I don't use this in production. I turn the number of nfsiod's down to 4 in production. >> % + } else >> % + nh->nh_seqcount /= 2; /* Not sequential access. */ > > Hmm, this is a bit different as well. sequential_heuristic() just > drops all clustering (seqcount = 1) here so I had followed that. I do > wonder if this change would be good for "normal" I/O as well? (Again, > I think it would do well to have "normal" I/O and NFS generally use > the same algorithm, but perhaps with the extra logic to handle reordered > writes more gracefully for NFS.) I don't know much about this. >> % + >> % + nh->nh_nextoff = uio->uio_offset + uio->uio_resid; > > Interesting. So this assumes the I/O never fails. Not too good. Some places like ffs_write() back out of failing i/o's, but I think they reduce ui_offset before the corresponding code for the non-nfs heuristic in vn_read/write() advances f_nextoff. >> % @@ -1225,4 +1251,5 @@ >> % vn_finished_write(mntp); >> % VFS_UNLOCK_GIANT(vfslocked); >> % + bwillwrite(); /* After VOP_WRITE to avoid reordering. */ >> % return(error); >> % } > > Hmm, this seems to be related to avoiding overloading the NFS server's > buffer cache? Just to avoid spurious reordering I think. Is this all still Giant locked? Giant might either reduce or increase interference between nfsd's, depending on the timing. >> ... >> % Index: nfs_syscalls.c >> % =================================================================== >> % RCS file: /home/ncvs/src/sys/nfsserver/Attic/nfs_syscalls.c,v >> % retrieving revision 1.119 >> % diff -u -2 -r1.119 nfs_syscalls.c >> % --- nfs_syscalls.c 30 Jun 2008 20:43:06 -0000 1.119 >> % +++ nfs_syscalls.c 2 Jul 2008 07:12:57 -0000 >> % @@ -86,5 +86,4 @@ >> % int nfsd_waiting = 0; >> % int nfsrv_numnfsd = 0; >> % -static int notstarted = 1; >> % >> % static int nfs_privport = 0; >> % @@ -448,7 +447,6 @@ >> % procrastinate = nfsrvw_procrastinate; >> % NFSD_UNLOCK(); >> % - if (writes_todo || (!(nd->nd_flag & ND_NFSV3) && >> % - nd->nd_procnum == NFSPROC_WRITE && >> % - procrastinate > 0 && !notstarted)) >> % + if (writes_todo || (nd->nd_procnum == NFSPROC_WRITE && >> % + procrastinate > 0)) >> % error = nfsrv_writegather(&nd, slp, >> % nfsd->nfsd_td, &mreq); > > This no longer seems to be present in 8. nfs_syscalls.c seems to have been replaced by nfs_srvkrpc.c. All history has been lost (obscured), but the code is quite different so a repo-copy wouldn't have worked much better. This created lots of garbage if not bugs: - the nfsrv.gathererdelay and nfsrv.gatherdelay_v3 sysctls are now in nfs_srvkrpc.c. They were already hard to associate with any effects, since their variables names don't match their sysctl names. The variables are named nfsrv_procrastinate and nfsrv_procrastinate_v3. - the *procrastinate* global variables are still declared in nfs.h and initialized to defaults in nfs_serv.c, but are no longer really used. - the local variable `procrastinate' and the above code to use it no longer exist - the macro for the default for the non-v3 sysctl, NFS_GATHERDELAY, is still defined in nfs.h, but is only used in the dead initialization. - the new nfs server doesn't have any gatherdelay or procrastinate symbols. Bjorn said that gatherdelay_v3 didn't work, and tried to fix it. The above is the final result that I have. I now remember trying this. Bjorn hoped that a nonzero gatherdelay would reduce reordering, but in practice it just reduces performance by waiting too long. Its default of 10 msec may have worked with 1 Mpbs ethernet, but can't possibly scale to 1 Gbps. ISTR that the value had to be very small, perhaps 100 usec, for the delay not to be too large, but when it is that small it has problems having any effects except to waste CPU in a different way than delaying. > One thing I had done was to use a separate set of heuristics for reading vs > writing. However, that is possibly dubious (and we don't do it for local > I/O), so I can easily drop that feature if desired. I think it is unlikely to make much difference. The heuristic always has to cover a very wide range of access patterns. Bruce