From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 20:45:50 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 480B5106564A; Thu, 25 Aug 2011 20:45:50 +0000 (UTC) (envelope-from jwd@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 34CD48FC15; Thu, 25 Aug 2011 20:45:50 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id p7PKjoWs089759; Thu, 25 Aug 2011 20:45:50 GMT (envelope-from jwd@freefall.freebsd.org) Received: (from jwd@localhost) by freefall.freebsd.org (8.14.4/8.14.4/Submit) id p7PKjouO089758; Thu, 25 Aug 2011 20:45:50 GMT (envelope-from jwd) Date: Thu, 25 Aug 2011 20:45:49 +0000 From: John To: John Baldwin Message-ID: <20110825204549.GB61776@FreeBSD.org> References: <201108251347.45460.jhb@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <201108251347.45460.jhb@freebsd.org> User-Agent: Mutt/1.4.2.3i Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 20:45:50 -0000 Hi John, This is an interesting fix. If I can I'll try patching a few systems and giving it a try. I don't know if this would help for timing comparisons, but years ago we used to run build work directly against our NFS storage. In general, we moved away from that to a two stage approach: cc foo.c -o /tmp/foo.o # where /tmp is a memory filesystem cp /tmp/foo.o /nfs/mounted/target/area/foo.o This provided for a very large performance boost. It's worth noting that different compilers require different levels of arm-wrestling to convince them to use the file specifed with -o correctly (and directly). With a simple .mk file change you could probably get an up-to-date comparison of the current system vs your patch vs sequential i/o only. I'll let you know what I find and if we see any regressions. Thanks, John ----- John Baldwin's Original Message ----- > I was doing some analysis of compiles over NFS at work recently and noticed > from 'iostat 1' on the NFS server that all my NFS writes were always 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS write > VOP to figure out the various kind of write requests that were being sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses fsync(). > Based on the debugging I added, I found that all of the FILESYNC writes were > marked as such because the buffer in question did not have B_ASYNC set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; > > I eventually tracked this down to the code in the NFS client that pushes out a > previous dirty region via 'bwrite()' when a write would dirty a non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > > (These writes are triggered during the compile of a file by the assembler > seeking back into the file it has already written out to apply various > fixups.) > > From this I concluded that the test above is flawed. We should be using > UNSTABLE writes for the writes above as the user has not requested them to > be synchronous. The issue (I believe) is that the NFS client is overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the request > has completed. Instead, it is a "fire and forget". This is not the same > thing as the IO_SYNC flag passed in ioflags during a write request which > requests fsync()-like behavior. To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check for > B_SYNC being set rather than checking for B_ASYNC being clear. > > That converted all the FILESYNC write RPCs from my builds into UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered writes on > the NFS server (all writes were still 16k). After digging around in the > code for a bit I found that ffs will only cluster writes if the passed in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op > and moved it into a function to compute a sequential I/O heuristic that > could be shared by both reads and writes. I also updated the sequential > heuristic code to advance the counter based on the number of 16k blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). The > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > > (This also fixes a bug in the new NFS server in that it wasn't actually > clustering reads since it never updated nh->nh_nextr.) > > Combining the two changes together gave me about a 1% reduction in wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and server > on 8.2-stable. I then ported the changes to the new client and server and > while I made sure they compiled, I have not tested the new client and server. > > -- > John Baldwin