From owner-freebsd-fs@FreeBSD.ORG Fri Aug 26 01:26:34 2011 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id DCA051065691 for ; Fri, 26 Aug 2011 01:26:34 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from esa-jnhn.mail.uoguelph.ca (esa-jnhn.mail.uoguelph.ca [131.104.91.44]) by mx1.freebsd.org (Postfix) with ESMTP id 8154A8FC0C for ; Fri, 26 Aug 2011 01:26:34 +0000 (UTC) X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Ap8EAI/uVk6DaFvO/2dsb2JhbAA7CBaENqQ8gUABAQQBIwRSBRYOCgICDRkCWQYch2kEqEiRXYEsgX+CEIERBJMZkRc X-IronPort-AV: E=Sophos;i="4.68,283,1312171200"; d="scan'208";a="135592738" Received: from erie.cs.uoguelph.ca (HELO zcs3.mail.uoguelph.ca) ([131.104.91.206]) by esa-jnhn-pri.mail.uoguelph.ca with ESMTP; 25 Aug 2011 20:57:38 -0400 Received: from zcs3.mail.uoguelph.ca (localhost.localdomain [127.0.0.1]) by zcs3.mail.uoguelph.ca (Postfix) with ESMTP id 286FBB3F06; Thu, 25 Aug 2011 20:57:38 -0400 (EDT) Date: Thu, 25 Aug 2011 20:57:38 -0400 (EDT) From: Rick Macklem To: John Baldwin Message-ID: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> In-Reply-To: <201108251347.45460.jhb@freebsd.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [172.17.91.202] X-Mailer: Zimbra 6.0.10_GA_2692 (ZimbraWebClient - FF3.0 (Win)/6.0.10_GA_2692) Cc: Rick Macklem , fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 26 Aug 2011 01:26:34 -0000 John Baldwin wrote: > I was doing some analysis of compiles over NFS at work recently and > noticed > from 'iostat 1' on the NFS server that all my NFS writes were always > 16k > writes (meaning that writes were never being clustered). I added some > debugging sysctls to the NFS client and server code as well as the FFS > write > VOP to figure out the various kind of write requests that were being > sent. I > found that during the NFS compile, the NFS client was sending a lot of > FILESYNC writes even though nothing in the compile process uses > fsync(). > Based on the debugging I added, I found that all of the FILESYNC > writes were > marked as such because the buffer in question did not have B_ASYNC > set: > > > if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) > == B_ASYNC) > iomode = NFSV3WRITE_UNSTABLE; > else > iomode = NFSV3WRITE_FILESYNC; > > I eventually tracked this down to the code in the NFS client that > pushes out a > previous dirty region via 'bwrite()' when a write would dirty a > non-contiguous > region in the buffer: > > if (bp->b_dirtyend > 0 && > (on > bp->b_dirtyend || (on + n) < bp->b_dirtyoff)) { > if (bwrite(bp) == EINTR) { > error = EINTR; > break; > } > goto again; > } > Well I'm not sure this bwrite() is required in the current buffer cache implementation. I far from understand the FreeBSD NFS buffer cache code (I just cloned it for the new NFS client), but it seems to pre-read in the block before writing it out, unless the write is of a full block. If this is the case, doing multiple non-contiguous writes to the buffer should be fine, I think? Way back in the late 1980s I implemented NFS buffer cache code that worked this way for writing: - dirty a region of the buffer - if another write that was contiguous with this region occurred, grow the dirty region - else if a non-contiguous write occurred, write the dirty region out synchronously, and then write the new dirty region - else if a read that was within the dirty region occurred, copy the data out of the buffer - else if a read that wasn't within the dirty region occurred, - write the dirty region out - read in the entire block - copy the data from the block Somewhere along the way, I believe this was converted to the Sun style (which is more like a local file system) where: - if a write is of a partial block - read the block in - modify the region of the block - mark the block dirty and start an asynchronous write of the block with this model, all the data in the block is up-to-date, so it doesn't really matter how many non-contiguous region(s) you modify, if you write the entire block after the modification(s). (For simplicity, I haven't worried about the EOF. Either model has to keep track of the file's size "np->n_size" and deal with where EOF is.) I have a "hunch" that the above code snippet is left over from my old model. (Which worked quite well for the old Portable C compiler and linker of the 4BSD Vax era.) I don't think any write is necessary at that place any more. (I also don't think it should need to keep track of b_dirtyoff, b_dirtyend.) Another thing that would be nice is to allow reads from the block while it is being written back. (Writes either need to be blocked or handled so that the block is still known to be dirty and needs to be written again, but reading the data during the write back should be fine.) I don't know, but I don't think that can happen now, given the way the buffers are locked during writing. > (These writes are triggered during the compile of a file by the > assembler > seeking back into the file it has already written out to apply various > fixups.) > > From this I concluded that the test above is flawed. We should be > using > UNSTABLE writes for the writes above as the user has not requested > them to > be synchronous. It was synchronous (in the sense that it would wait until the write RPC was completed) in the old model, since the new dirty region replaced the old one. (As noted above, I don't think a bwrite() of any kind is needed there now.) > The issue (I believe) is that the NFS client is > overloading > the B_ASYNC flag. The B_ASYNC flag means that the caller of bwrite() > (or rather bawrite()) is not synchronously blocking to see if the > request > has completed. Instead, it is a "fire and forget". This is not the > same > thing as the IO_SYNC flag passed in ioflags during a write request > which > requests fsync()-like behavior. To disambiguate the two I added a new > B_SYNC flag and changed the NFS clients to set this for write requests > with IO_SYNC set. I then updated the condition above to instead check > for > B_SYNC being set rather than checking for B_ASYNC being clear. > Yes, in the bad old days B_ASYNC meant start the write RPC but don't wait for it to complete. I think what you have said is correct above, in that setting FILESYNC should be based on the IO_SYNC flag only. > That converted all the FILESYNC write RPCs from my builds into > UNSTABLE > write RPCs. The patch for that is at > http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. > > However, even with this change I was still not getting clustered > writes on > the NFS server (all writes were still 16k). After digging around in > the > code for a bit I found that ffs will only cluster writes if the passed > in > 'ioflags' to ffs_write() specify a sequential hint. I then noticed > that > the NFS server has code to keep track of sequential I/O heuristics for > reads, but not writes. I took the code from the NFS server's read op > and moved it into a function to compute a sequential I/O heuristic > that > could be shared by both reads and writes. I also updated the > sequential > heuristic code to advance the counter based on the number of 16k > blocks > in each write instead of just doing ++ to match what we do for local > file writes in sequential_heuristic() in vfs_vnops.c. Using this did > give me some measure of NFS write clustering (though I can't peg my > disks at MAXPHYS the way a dd to a file on a local filesystem can). > The > patch for these changes is at > http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch > The above says you understand this stuff and I don't. However, I will note that the asynchronous case, which starts the write RPC now, makes clustering difficult and limits what you can do. (I think it was done in the bad old days to avoid flooding the buffer cache and then having things pushing writes back to get buffers. These days the buffer cache can be much bigger and it's easy to create kernel threads to do write backs at appropriate times. As such, I'd lean away from asynchronous (as in start the write now) and towards delayed writes. If the writes are delayed "bdwrite()" then I think it is much easier to find contiguous dirty buffers to do as one write RPC. However, if you just do bdwrite()s, there tends to be big bursts of write RPCs when the syncer does its thing, unless kernel threads are working through the cache doing write backs. Since there are nfsiod threads, maybe these could scan for contiguous dirty buffers and start big write RPCs for them? If there was some time limit set for how long the buffer sits dirty before it gets a write started for it, that would avoid a burst caused by the syncer. Also, if you are lucky w.r.t. doing delayed writes for temporary files, the file gets deleted before the write-back. > (This also fixes a bug in the new NFS server in that it wasn't > actually > clustering reads since it never updated nh->nh_nextr.) > Thanks, I have no idea what this is... > Combining the two changes together gave me about a 1% reduction in > wall > time for my builds: > > +------------------------------------------------------------------------------+ > |+ + ++ + +x++*x xx+x x x| > | |___________A__|_M_______|_A____________| | > +------------------------------------------------------------------------------+ > N Min Max Median Avg Stddev > x 10 1869.62 1943.11 1881.89 1886.12 21.549724 > + 10 1809.71 1886.53 1869.26 1860.706 21.530664 > Difference at 95.0% confidence > -25.414 +/- 20.2391 > -1.34742% +/- 1.07305% > (Student's t, pooled s = 21.5402) > > One caveat: I tested both of these patches on the old NFS client and > server > on 8.2-stable. I then ported the changes to the new client and server > and > while I made sure they compiled, I have not tested the new client and > server. > > -- > John Baldwin Good stuff John. Sounds like you're having fun with it. I would like to see it clustering using delayed writes and then doing the biggest write RPCs the server will allow. (Solaris10 allows 1Mbyte write RPCs. At this point, FreeBSD is limited to MAX_BSIZE, which is only 64K but hopefully can be increased?) rick