Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 26 Nov 2013 18:41:13 -0500 (EST)
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        FreeBSD FS <freebsd-fs@freebsd.org>
Cc:        Kostik Belousov <kib@freebsd.org>
Subject:   RFC: NFS client patch to reduce sychronous writes
Message-ID:  <731168702.21452440.1385509273449.JavaMail.root@uoguelph.ca>
In-Reply-To: <1139579526.21452374.1385509250511.JavaMail.root@uoguelph.ca>

index | next in thread | previous in thread | raw e-mail

[-- Attachment #1 --]
Hi,

The current NFS client does a synchronous write
to the server when a non-contiguous write to the
same buffer cache block occurs. This is done because
there is a single dirty byte range recorded in the
buf structure. This results in a lot of synchronous
writes for software builds (I believe it is the loader
that loves to write small non-contiguous chunks to
its output file). Some users disable synchronous
writing on the server to improve performance, but
this puts them at risk of data loss when the server
crashes.

Long ago jhb@ emailed me a small patch that avoided
the synchronous writes by simply making the dirty byte
range a superset of the bytes written. The problem
with doing this is that for a rare (possibly non-existent)
application that writes non-overlapping byte ranges
to the same file from multiple clients concurrently,
some of these writes might get lost by stale data in
the superset of the byte range being written back to
the server. (Crappy, run on sentence, but hopefully
it makes sense;-)

I created a patch that maintained a list of dirty byte
ranges. It was complicated and I found that the list
often had to be > 100 entries to avoid the synchronous
writes.

So, I think his solution is preferable, although I've
added a couple of tweaks:
- The synchronous writes (old/current algorithm) is still
  used if there has been file locking done on the file.
  (I think any app. that writes a file from multiple clients
   will/should use file locking.)
- The synchronous writes (old/current algorithm) is used
  if a sysctl is set. This will avoid breakage for any app.
  (if there is one) that writes a file from multiple clients
  without doing file locking.

For testing on my very slow single core hardware, I see about
a 10% improvement in kernel build times, but with fewer I/O
RPCs:
             Read RPCs  Write RPCs
old/current  50K        122K
patched      39K         40K
--> it reduced the Read RPC count by about 20% and cut the
    Write RPC count to 1/3rd.
I think jhb@ saw pretty good performance results with his patch.

Anyhow, the patch is attached and can also be found here:
  http://people.freebsd.org/~rmacklem/noncontig-write.patch

I'd like to invite folks to comment/review/test this patch,
since I think it is ready for head/current.

Thanks, rick
ps: Kostik, maybe you could look at it. In particular, I am
    wondering if I zero'd out the buffer the correct way, via
    vfs_bio_bzero_buf()?

[-- Attachment #2 --]
--- fs/nfsclient/nfs_clbio.c.orig	2013-08-28 18:45:41.000000000 -0400
+++ fs/nfsclient/nfs_clbio.c	2013-11-25 21:42:16.000000000 -0500
@@ -72,6 +72,12 @@ extern int nfs_keep_dirty_on_error;
 
 int ncl_pbuf_freecnt = -1;	/* start out unlimited */
 
+SYSCTL_DECL(_vfs_nfs);
+
+static int	ncl_oldnoncontigwriting = 0;
+SYSCTL_INT(_vfs_nfs, OID_AUTO, old_noncontig_writing, CTLFLAG_RW,
+	   &ncl_oldnoncontigwriting, 0, "NFS use old noncontig writing alg");
+
 static struct buf *nfs_getcacheblk(struct vnode *vp, daddr_t bn, int size,
     struct thread *td);
 static int nfs_directio_write(struct vnode *vp, struct uio *uiop,
@@ -874,7 +880,7 @@ ncl_write(struct vop_write_args *ap)
 	struct vattr vattr;
 	struct nfsmount *nmp = VFSTONFS(vp->v_mount);
 	daddr_t lbn;
-	int bcount;
+	int bcount, noncontig_write, obcount;
 	int bp_cached, n, on, error = 0, error1;
 	size_t orig_resid, local_resid;
 	off_t orig_size, tmp_off;
@@ -1037,7 +1043,15 @@ again:
 		 * unaligned buffer size.
 		 */
 		mtx_lock(&np->n_mtx);
-		if (uio->uio_offset == np->n_size && n) {
+		if ((np->n_flag & NHASBEENLOCKED) == 0 &&
+		    ncl_oldnoncontigwriting == 0)
+			noncontig_write = 1;
+		else
+			noncontig_write = 0;
+		if ((uio->uio_offset == np->n_size ||
+		    (noncontig_write != 0 &&
+		    lbn == (np->n_size / biosize) &&
+		    uio->uio_offset + n > np->n_size)) && n) {
 			mtx_unlock(&np->n_mtx);
 			/*
 			 * Get the buffer (in its pre-append state to maintain
@@ -1045,8 +1059,8 @@ again:
 			 * nfsnode after we have locked the buffer to prevent
 			 * readers from reading garbage.
 			 */
-			bcount = on;
-			bp = nfs_getcacheblk(vp, lbn, bcount, td);
+			obcount = np->n_size - (lbn * biosize);
+			bp = nfs_getcacheblk(vp, lbn, obcount, td);
 
 			if (bp != NULL) {
 				long save;
@@ -1058,9 +1072,12 @@ again:
 				mtx_unlock(&np->n_mtx);
 
 				save = bp->b_flags & B_CACHE;
-				bcount += n;
+				bcount = on + n;
 				allocbuf(bp, bcount);
 				bp->b_flags |= save;
+				if (noncontig_write != 0 && bcount > obcount)
+					vfs_bio_bzero_buf(bp, obcount, bcount -
+					    obcount);
 			}
 		} else {
 			/*
@@ -1159,19 +1176,23 @@ again:
 		 * area, just update the b_dirtyoff and b_dirtyend,
 		 * otherwise force a write rpc of the old dirty area.
 		 *
+		 * If there has been a file lock applied to this file
+		 * or vfs.nfs.old_noncontig_writing is set, do the following:
 		 * While it is possible to merge discontiguous writes due to
 		 * our having a B_CACHE buffer ( and thus valid read data
 		 * for the hole), we don't because it could lead to
 		 * significant cache coherency problems with multiple clients,
 		 * especially if locking is implemented later on.
 		 *
-		 * As an optimization we could theoretically maintain
-		 * a linked list of discontinuous areas, but we would still
-		 * have to commit them separately so there isn't much
-		 * advantage to it except perhaps a bit of asynchronization.
+		 * If vfs.nfs.old_noncontig_writing is not set and there has
+		 * not been file locking done on this file:
+		 * Relax coherency a bit for the sake of performance and
+		 * expand the current dirty region to contain the new
+		 * write even if it means we mark some non-dirty data as
+		 * dirty.
 		 */
 
-		if (bp->b_dirtyend > 0 &&
+		if (noncontig_write == 0 && bp->b_dirtyend > 0 &&
 		    (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
 			if (bwrite(bp) == EINTR) {
 				error = EINTR;
--- fs/nfsclient/nfsnode.h.orig	2013-11-19 18:17:37.000000000 -0500
+++ fs/nfsclient/nfsnode.h	2013-11-25 21:29:58.000000000 -0500
@@ -157,6 +157,7 @@ struct nfsnode {
 #define	NLOCKWANT	0x00010000  /* Want the sleep lock */
 #define	NNOLAYOUT	0x00020000  /* Can't get a layout for this file */
 #define	NWRITEOPENED	0x00040000  /* Has been opened for writing */
+#define	NHASBEENLOCKED	0x00080000  /* Has been file locked. */
 
 /*
  * Convert between nfsnode pointers and vnode pointers
--- fs/nfsclient/nfs_clvnops.c.orig	2013-11-19 18:19:42.000000000 -0500
+++ fs/nfsclient/nfs_clvnops.c	2013-11-25 21:32:47.000000000 -0500
@@ -3079,6 +3079,10 @@ nfs_advlock(struct vop_advlock_args *ap)
 					np->n_change = va.va_filerev;
 				}
 			}
+			/* Mark that a file lock has been acquired. */
+			mtx_lock(&np->n_mtx);
+			np->n_flag |= NHASBEENLOCKED;
+			mtx_unlock(&np->n_mtx);
 		}
 		NFSVOPUNLOCK(vp, 0);
 		return (0);
@@ -3098,6 +3102,12 @@ nfs_advlock(struct vop_advlock_args *ap)
 				error = ENOLCK;
 			}
 		}
+		if (error == 0 && ap->a_op == F_SETLK) {
+			/* Mark that a file lock has been acquired. */
+			mtx_lock(&np->n_mtx);
+			np->n_flag |= NHASBEENLOCKED;
+			mtx_unlock(&np->n_mtx);
+		}
 	}
 	return (error);
 }
help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?731168702.21452440.1385509273449.JavaMail.root>