Date: Sat, 27 Aug 2011 21:21:24 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: Rick Macklem <rmacklem@freebsd.org>, fs@freebsd.org Subject: Re: Fixes to allow write clustering of NFS writes from a FreeBSD NFS client Message-ID: <20110827194709.E1286@besplex.bde.org> In-Reply-To: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca> References: <354219004.370807.1314320258153.JavaMail.root@erie.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 25 Aug 2011, Rick Macklem wrote: > John Baldwin wrote: >> ... >> That converted all the FILESYNC write RPCs from my builds into >> UNSTABLE >> write RPCs. The patch for that is at >> http://www.FreeBSD.org/~jhb/patches/nfsclient_sync_writes.patch. >> >> However, even with this change I was still not getting clustered >> writes on >> the NFS server (all writes were still 16k). After digging around in >> the >> code for a bit I found that ffs will only cluster writes if the passed >> in >> 'ioflags' to ffs_write() specify a sequential hint. I then noticed >> that >> the NFS server has code to keep track of sequential I/O heuristics for >> reads, but not writes. I took the code from the NFS server's read op >> and moved it into a function to compute a sequential I/O heuristic >> that >> could be shared by both reads and writes. I also updated the >> sequential >> heuristic code to advance the counter based on the number of 16k >> blocks >> in each write instead of just doing ++ to match what we do for local >> file writes in sequential_heuristic() in vfs_vnops.c. Using this did >> give me some measure of NFS write clustering (though I can't peg my >> disks at MAXPHYS the way a dd to a file on a local filesystem can). >> The >> patch for these changes is at >> http://www.FreeBSD.org/~jhb/patches/nfsserv_cluster_writes.patch >> > The above says you understand this stuff and I don't. However, I will note I only know much about this part (I once actually understood it). > that the asynchronous case, which starts the write RPC now, makes clustering > difficult and limits what you can do. (I think it was done in the bad old async as opposed to delayed is bad, but is mostly avoided anyway, at at least ffs and vfs levels on the server. This was a major optimization by dysoon about 15 years ago. I don't understand the sync/async/ delayed writes on the client at the nfs level. At least the old nfsclient doesn't even call bawrite(), but it might do the equivalent using a flag. On the server, nfs doesn't use any of bwrite/bawrite/bdwrite(). It just uses VOP_WRITE() which does whatever the server file system does. Most file systems in FreeBSD use cluster_write() in most cases. This is from 4.4BSD-Lite. It replaces an unconditional bawrite() in Net/2 in the most usual case where the write is of exactly 1 fs-block (usually starting with a larger write that is split up into fs-blocks and a possible sub-block at the beginning and end only). cluster_write() also has major optimizations by dyson. In the usual case it turns into bdwrite(), to give a chance for a full cluster to accumulate, and in most cases there would be little difference in the effects if the callers were simplified to call bdwrite() directly. (The difference is just that with cluster_write(), a write will occur as soon as a cluster forms, while with bdwrite() a write will not occur until the next sync unless the buffer cache is very dirty. bawrite() used to be used instead of bdwrite() mainly to reduce pressure on the buffer cache. It was thought that the end of a block was a good time to start writing. That was when 16 buffers containing 4K each was a lot of data :-). The next and last major optimization in this area was to improve VOP_FSYNC() to handle a large number of delayed writes better. It was changed to uses vfs_bio_awrite() where in 4.4BSD it used bawrite(). vfs_bio_awrite() is closer to the implementation and has a better understanding of clustering than bawrite(). I forget why bawrite() wasn't just replaced by the internals of vfs_bio_awrite(). sync writes from nfs and O_SYNC from userland tend to defeat all of the bawrite()/bdwrite() optimizations, by forcing a bwrite(). nfs defaults to sync writes, so all it can do to use the optimizations is to do very large sync writes which are split up into smaller delayed ones in a way that doesn't interfere with clustering. I don't understand the details of what it does. > days to avoid flooding the buffer cache and then having things pushing writes > back to get buffers. These days the buffer cache can be much bigger and it's > easy to create kernel threads to do write backs at appropriate times. As such, > I'd lean away from asynchronous (as in start the write now) and towards delayed > writes. On FreeBSD servers, this is mostly handled already by mostly using cluster_write(). Buffer cache pressure is still difficult to handle though. I saw it having bad effects mainly in my silly benchmark for this nfs server clustering optimization, of writing 1GB. The buffer cache would fill up with dirty buffers which take too long to write (1000-2000 dirty ones out of 8000. 2000 of size 16K each is 32MB. These take 0.5-1 seconds to write). While they were being written, the nfsclient has to stop sending (it shouldn't stop until the buffer cache is completely full but it does). Any stoppage gives under-utilization of the network, and my network has just enough bandwidth to keep up with the disk. Stopping for a short time wouldn't be bad, but for some reason it didn't restart soon enough to keep the writes streaming. I didn't see this when I repeated the benchmark yesterday. I must have done some tuning to reduce the problem, but forget what it was. I would start looking for it near the buf_dirty_count_severe() test in ffs_write(). This defeats clustering and may be too agressive or mistuned. What I don't like about this is that when severe buffer cache pressure develops, using bawrite() instead of cluster_write() tends to increase the pressure, by writing new dirty buffers at half the speed. I never saw any problems from the buffer cache pressure with local disks (except for writing to DVDs, writes often stall near getblk() for several seconds). > If the writes are delayed "bdwrite()" then I think it is much easier > to find contiguous dirty buffers to do as one write RPC. However, if > you just do bdwrite()s, there tends to be big bursts of write RPCs when > the syncer does its thing, unless kernel threads are working through the > cache doing write backs. It might not matter a lot (except on large-latency links) what the client does. MTUs of only 1500 are still too common, so there is a lot of reassumble of blocks at the network level. A bit more at the RPC and (both client and server) block level won't matter provided you don't synchronize after every piece. Hmm, those bursts on the client aren't so good, and may explain why the client stalled in my tests. At least the old nfs client never uses either cluster_write() or vfs_bio_awrite() (or bawrite()). I don't understand why, but if if uses bdwrite() when it should use cluster_write() then it won't have the advantage of cluster_write() over bdwrite() -- of writing as soon as a cluster forms. It does use B_CLUSTEROK. I think this mainly causes clustering to work when all the delayed-write buffers are written eventually. Now I don't see much point in using either delayed writes or clustering on the client. Clustering is needed for unsolid state disks mainly because their seek time is so large. Larger blocks are only good for their secondary effects of reducing overheads and latency. > Since there are nfsiod threads, maybe these could scan for contiguous > dirty buffers and start big write RPCs for them? If there was some time > limit set for how long the buffer sits dirty before it gets a write started > for it, that would avoid a burst caused by the syncer. One of my tunings was to reduce the number of nfsiod's. > Also, if you are lucky w.r.t. doing delayed writes for temporary files, the > file gets deleted before the write-back. In ffs, this is another optimization by dyson. Isn't it defeated by sync writes from ffs? Is it possible for a file written on the client to never reach the server? Even if the data doesn't, I think the directory and inode creation should. Even for ffs mounted async, I think there are writes of some metadata for deleted files, because although the data blocks are dead, some metadata blocks like ones for inodes are shared with other files and must have been dirtied by create followed by delete, so they remain undead but are considered dirty although their accumulated changes should be null. The writes are just often coalesced by the delay, so instead of 1000 of writes to the same place for an inode that is created and deleted 500 times, you get just 1 write for null changes at the end. My version of ffs_update() has some optimizations to avoid writing null changes, but I think this doesn't help here since it still sees the changes in-core as they occur. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20110827194709.E1286>
