From owner-freebsd-fs@FreeBSD.ORG Thu Aug 25 17:47:47 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 40603106564A; Thu, 25 Aug 2011 17:47:47 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 1A8B48FC13; Thu, 25 Aug 2011 17:47:47 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id AA49146B0C; Thu, 25 Aug 2011 13:47:46 -0400 (EDT) Received: from jhbbsd.localnet (unknown [209.249.190.124]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id E56638A02F; Thu, 25 Aug 2011 13:47:45 -0400 (EDT) From: John Baldwin To: Rick Macklem Date: Thu, 25 Aug 2011 13:47:45 -0400 User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; ) MIME-Version: 1.0 Content-Type: Text/Plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Message-Id: <201108251347.45460.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Thu, 25 Aug 2011 13:47:46 -0400 (EDT) Cc: fs@freebsd.org Subject: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Aug 2011 17:47:47 -0000 I was doing some analysis of compiles over NFS at work recently and noticed from 'iostat 1' on the NFS server that all my NFS writes were always 16k writes (meaning that writes were never being clustered). I added some debugging sysctls to the NFS client and server code as well as the FFS write VOP to figure out the various kind of write requests that were being sent. = I found that during the NFS compile, the NFS client was sending a lot of =46ILESYNC writes even though nothing in the compile process uses fsync(). Based on the debugging I added, I found that all of the FILESYNC writes were marked as such because the buffer in question did not have B_ASYNC set: if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) =3D= =3D B_ASYNC) iomode =3D NFSV3WRITE_UNSTABLE; else iomode =3D NFSV3WRITE_FILESYNC; I eventually tracked this down to the code in the NFS client that pushes ou= t a previous dirty region via 'bwrite()' when a write would dirty a non-contigu= ous region in the buffer: if (bp->b_dirtyend > 0 && (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { if (bwrite(bp) =3D=3D EINTR) { error =3D EINTR; break; } goto again; } (These writes are triggered during the compile of a file by the assembler seeking back into the file it has already written out to apply various fixups.) =46rom this I concluded that the test above is flawed. We should be using UNSTABLE writes for the writes above as the user has not requested them to be synchronous. The issue (I believe) is that the NFS client is overloading the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() (or rather bawrite()) is not synchronously blocking to see if the request has completed. Instead, it is a "fire and forget". This is not the same thing as the IO_SYNC flag passed in ioflags during a write request which requests fsync()-like behavior. To disambiguate the two I added a new B_SYNC flag and changed the NFS clients to set this for write requests with IO_SYNC set. I then updated the condition above to instead check for B_SYNC being set rather than checking for B_ASYNC being clear. That converted all the FILESYNC write RPCs from my builds into UNSTABLE write RPCs. The patch for that is at http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. However, even with this change I was still not getting clustered writes on the NFS server (all writes were still 16k). After digging around in the code for a bit I found that ffs will only cluster writes if the passed in 'ioflags' to ffs_write() specify a sequential hint. I then noticed that the NFS server has code to keep track of sequential I/O heuristics for reads, but not writes. I took the code from the NFS server's read op and moved it into a function to compute a sequential I/O heuristic that could be shared by both reads and writes. I also updated the sequential heuristic code to advance the counter based on the number of 16k blocks in each write instead of just doing ++ to match what we do for local file writes in sequential_heuristic() in vfs_vnops.c. Using this did give me some measure of NFS write clustering (though I can't peg my disks at MAXPHYS the way a dd to a file on a local filesystem can). The patch for these changes is at http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch (This also fixes a bug in the new NFS server in that it wasn't actually clustering reads since it never updated nh->nh_nextr.) Combining the two changes together gave me about a 1% reduction in wall time for my builds: +--------------------------------------------------------------------------= =2D---+ |+ + ++ + +x++*x xx+x x = x| | |___________A__|_M_______|_A____________| = | +--------------------------------------------------------------------------= =2D---+ N Min Max Median Avg Stddev x 10 1869.62 1943.11 1881.89 1886.12 21.549724 + 10 1809.71 1886.53 1869.26 1860.706 21.530664 Difference at 95.0% confidence -25.414 +/- 20.2391 -1.34742% +/- 1.07305% (Student's t, pooled s =3D 21.5402) One caveat: I tested both of these patches on the old NFS client and server on 8.2-stable. I then ported the changes to the new client and server and while I made sure they compiled, I have not tested the new client and serve= r. =2D-=20 John Baldwin