From owner-freebsd-fs@FreeBSD.ORG  Thu Aug 25 17:47:47 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 40603106564A;
	Thu, 25 Aug 2011 17:47:47 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 1A8B48FC13;
	Thu, 25 Aug 2011 17:47:47 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id AA49146B0C;
	Thu, 25 Aug 2011 13:47:46 -0400 (EDT)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id E56638A02F;
	Thu, 25 Aug 2011 13:47:45 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Rick Macklem <rmacklem@freebsd.org>
Date: Thu, 25 Aug 2011 13:47:45 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110617; KDE/4.5.5; amd64; ; )
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Message-Id: <201108251347.45460.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Thu, 25 Aug 2011 13:47:46 -0400 (EDT)
Cc: fs@freebsd.org
Subject: Fixes to allow write clustering of NFS writes from a FreeBSD NFS
	client
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Aug 2011 17:47:47 -0000

I was doing some analysis of compiles over NFS at work recently and noticed
from 'iostat 1' on the NFS server that all my NFS writes were always 16k
writes (meaning that writes were never being clustered).  I added some
debugging sysctls to the NFS client and server code as well as the FFS write
VOP to figure out the various kind of write requests that were being sent. =
 I
found that during the NFS compile, the NFS client was sending a lot of
=46ILESYNC writes even though nothing in the compile process uses fsync().
Based on the debugging I added, I found that all of the FILESYNC writes were
marked as such because the buffer in question did not have B_ASYNC set:


		if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) =3D=
=3D B_ASYNC)
		    iomode =3D NFSV3WRITE_UNSTABLE;
		else
		    iomode =3D NFSV3WRITE_FILESYNC;

I eventually tracked this down to the code in the NFS client that pushes ou=
t a
previous dirty region via 'bwrite()' when a write would dirty a non-contigu=
ous
region in the buffer:

		if (bp->b_dirtyend > 0 &&
		    (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
			if (bwrite(bp) =3D=3D EINTR) {
				error =3D EINTR;
				break;
			}
			goto again;
		}

(These writes are triggered during the compile of a file by the assembler
seeking back into the file it has already written out to apply various
fixups.)

=46rom this I concluded that the test above is flawed.  We should be using
UNSTABLE writes for the writes above as the user has not requested them to
be synchronous.  The issue (I believe) is that the NFS client is overloading
the B_ASYNC flag.  The B_ASYNC flag means that the caller of bwrite()
(or rather bawrite()) is not synchronously blocking to see if the request
has completed.  Instead, it is a "fire and forget".  This is not the same
thing as the IO_SYNC flag passed in ioflags during a write request which
requests fsync()-like behavior.  To disambiguate the two I added a new
B_SYNC flag and changed the NFS clients to set this for write requests
with IO_SYNC set.  I then updated the condition above to instead check for
B_SYNC being set rather than checking for B_ASYNC being clear.

That converted all the FILESYNC write RPCs from my builds into UNSTABLE
write RPCs.  The patch for that is at
http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch.

However, even with this change I was still not getting clustered writes on
the NFS server (all writes were still 16k).  After digging around in the
code for a bit I found that ffs will only cluster writes if the passed in
'ioflags' to ffs_write() specify a sequential hint.  I then noticed that
the NFS server has code to keep track of sequential I/O heuristics for
reads, but not writes.  I took the code from the NFS server's read op
and moved it into a function to compute a sequential I/O heuristic that
could be shared by both reads and writes.  I also updated the sequential
heuristic code to advance the counter based on the number of 16k blocks
in each write instead of just doing ++ to match what we do for local
file writes in sequential_heuristic() in vfs_vnops.c.  Using this did
give me some measure of NFS write clustering (though I can't peg my
disks at MAXPHYS the way a dd to a file on a local filesystem can).  The
patch for these changes is at
http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch

(This also fixes a bug in the new NFS server in that it wasn't actually
clustering reads since it never updated nh->nh_nextr.)

Combining the two changes together gave me about a 1% reduction in wall
time for my builds:

+--------------------------------------------------------------------------=
=2D---+
|+                   +     ++    + +x++*x  xx+x    x                       =
   x|
|                 |___________A__|_M_______|_A____________|                =
    |
+--------------------------------------------------------------------------=
=2D---+
    N           Min           Max        Median           Avg        Stddev
x  10       1869.62       1943.11       1881.89       1886.12     21.549724
+  10       1809.71       1886.53       1869.26      1860.706     21.530664
Difference at 95.0% confidence
        -25.414 +/- 20.2391
        -1.34742% +/- 1.07305%
        (Student's t, pooled s =3D 21.5402)

One caveat: I tested both of these patches on the old NFS client and server
on 8.2-stable.  I then ported the changes to the new client and server and
while I made sure they compiled, I have not tested the new client and serve=
r.

=2D-=20
John Baldwin