From owner-freebsd-fs@FreeBSD.ORG  Fri Aug 26 17:43:40 2011
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id AE55B1065674;
	Fri, 26 Aug 2011 17:43:40 +0000 (UTC)
	(envelope-from rmacklem@uoguelph.ca)
Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca
	[131.104.91.36])
	by mx1.freebsd.org (Postfix) with ESMTP id 3C5538FC13;
	Fri, 26 Aug 2011 17:43:39 +0000 (UTC)
X-IronPort-Anti-Spam-Filtered: true
X-IronPort-Anti-Spam-Result: Ap4EABjaV06DaFvO/2dsb2JhbABDhEykUIFAAQYjBFIbDgwCDRkCWQaxApFrgSyED4ERBJMakRw
X-IronPort-AV: E=Sophos;i="4.68,286,1312171200"; d="scan'208";a="132366829"
Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca)
	([131.104.91.206])
	by esa-annu-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 13:43:39 -0400
Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1])
	by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 5B581B3F0F;
	Fri, 26 Aug 2011 13:43:39 -0400 (EDT)
Date: Fri, 26 Aug 2011 13:43:39 -0400 (EDT)
From: Rick Macklem <rmacklem@uoguelph.ca>
To: John Baldwin <jhb@freebsd.org>
Message-ID: <1882200362.409964.1314380619360.JavaMail.root@erie.cs.uoguelph.ca>
In-Reply-To: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-Originating-IP: [172.17.91.201]
X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692)
Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org
Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD
 NFS client
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 26 Aug 2011 17:43:40 -0000

Correcting myself yet again:
> I eventually tracked this down to the code in the NFS client that
> pushes out a
> previous dirty region via 'bwrite()' when a write would dirty a
> non-contiguous
> region in the buffer:
> 
> if (bp->b_dirtyend > 0 &&
> (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) {
> if (bwrite(bp) == EINTR) {
> error = EINTR;
> break;
> }
> goto again;
> }
> 
  Btw, the code was correct to use FILESYNC for this case.
  Why? Well, if the b_dirtyoff, b_dirtyend are used by the "bottom half"
  for the write/commit RPCs, the client won't know to re-write/commit
  the range specified by b_dirtyoff/b_dirtyend after the range changes.
  (ie. If the server crashes/reboots between the UNSTABLE write and the
   commit, the change will get lost.)

  However, if you calculate the off, len arguments for the Commit RPC
  to cover the entire block and not just b_dirtyoff->b_dirtyend, then
  doing the write UNSTABLE should be fine. (Having the range larger than
  the what was written should be ok. In fact the FreeBSD server ignore
  the arguments and commits the entire file via VOP_FSYNC().)

I realize I was wrong w.r.t this.
If the server crashes and reboots between the write RPCs and the Commit RPC,
the client will only know the last byte range to re-write.
For this to work correctly for UNSTABLE writes, a list of dirty byte ranges
must be maintained and the client must do write RPCs for all of them (and do
them again, if the server crashes before the commit).
Btw, there is code in the NFSv4 stuff that handles a list of byte ranges.
It does so for the byte range locking, but you could just rename
struct nfscllock something without `lock` in it and then reuse
nfscl_updatelock() to handle the list. (It might need a few tweaks for
the non-lock case, but shouldn`t need much.)

Hopefully I have finally got this correct and have not totally confused
everyone, rick