From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 17:43:40 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id AE55B1065674; Fri, 26 Aug 2011 17:43:40 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-annu.mail.uoguelph.ca (esa-annu.mail.uoguelph.ca [131.104.91.36]) by mx1.freebsd.org (Postfix) with ESMTP id 3C5538FC13; Fri, 26 Aug 2011 17:43:39 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap4EABjaV06DaFvO/2dsb2JhbABDhEykUIFAAQYjBFIbDgwCDRkCWQaxApFrgSyED4ERBJMakRw X-IronPort-AV: E=Sophos;i="4.68,286,1312171200"; d="scan'208";a="132366829" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-annu-pri.mail.uoguelph.ca with ESMTP; 26 Aug 2011 13:43:39 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 5B581B3F0F; Fri, 26 Aug 2011 13:43:39 -0400 (EDT) Date: Fri, 26 Aug 2011 13:43:39 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <1882200362.409964.1314380619360.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.201] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 17:43:40 -0000 Correcting myself yet again: > I eventually tracked this down to the code in the NFS client that > pushes out a > previous dirty region via 'bwrite()' when a write would dirty a > non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > Btw, the code was correct to use FILESYNC for this case. Why? Well, if the b_dirtyoff, b_dirtyend are used by the "bottom half" for the write/commit RPCs, the client won't know to re-write/commit the range specified by b_dirtyoff/b_dirtyend after the range changes. (ie. If the server crashes/reboots between the UNSTABLE write and the commit, the change will get lost.) However, if you calculate the off, len arguments for the Commit RPC to cover the entire block and not just b_dirtyoff->b_dirtyend, then doing the write UNSTABLE should be fine. (Having the range larger than the what was written should be ok. In fact the FreeBSD server ignore the arguments and commits the entire file via VOP_FSYNC().) I realize I was wrong w.r.t this. If the server crashes and reboots between the write RPCs and the Commit RPC, the client will only know the last byte range to re-write. For this to work correctly for UNSTABLE writes, a list of dirty byte ranges must be maintained and the client must do write RPCs for all of them (and do them again, if the server crashes before the commit). Btw, there is code in the NFSv4 stuff that handles a list of byte ranges. It does so for the byte range locking, but you could just rename struct nfscllock something without `lock` in it and then reuse nfscl_updatelock() to handle the list. (It might need a few tweaks for the non-lock case, but shouldn`t need much.) Hopefully I have finally got this correct and have not totally confused everyone, rick