Date: Wed, 14 Apr 2010 14:40:45 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Rick Macklem <rmacklem@uoguelph.ca> Cc: arch@FreeBSD.org, Andriy Gapon <avg@FreeBSD.org> Subject: Re: (in)appropriate uses for MAXBSIZE Message-ID: <20100414135230.U12587@delplex.bde.org> In-Reply-To: <Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca> References: <4BBEE2DD.3090409@freebsd.org> <Pine.GSO.4.63.1004090941200.14439@muncher.cs.uoguelph.ca> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> <Pine.GSO.4.63.1004110946400.27203@muncher.cs.uoguelph.ca>
next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, 11 Apr 2010, Rick Macklem wrote: > On Sun, 11 Apr 2010, Bruce Evans wrote: > >> Er, the maximum size of buffers in the buffer cache is especially >> irrelevant for nfs. It is almost irrelevant for physical disks because >> clustering normally increases the bulk transfer size to MAXPHYS. >> Clustering takes a lot of CPU but doesn't affect the transfer rate much >> unless there is not enough CPU. It is even less relevant for network >> i/o since there is a sort of reverse-clustering -- the buffers get split >> up into tiny packets (normally 1500 bytes less some header bytes) at >> the hardware level. ... > > I've done a simple experiment on Mac OS X 10, where I tried different > sizes for the read and write RPCs plus different amounts of > read-ahead/write-behind and found the I/O rate increased linearly, > up to the max allowed by Mac OS X (MAXBSIZE == 128K) without > read-ahead/write-behind. Using read-ahead/write-behind the performance > didn't increase at all, until the RPC read/write size was reduced. > (Solaris10 is using 256K by default and allowing up to 1Mb for read/write > RPC size now, so they seem to think that large values work well?) > > When you start using a WAN environment, large read/write RPCs really > help, from what I've seen, since that helps fill the TCP pipe > (bits * latency between client<->server). > > I care much more about WAN performance than LAN performance w.r.t. this. Indeed, I was only caring about a LAN environment. Especially with LANs optimized for latency (50-100 uS), nfs performance is poor for small files, at least for the old nfs client, mainly due to close to open consistency defeating caching, but not a problem for bulk transfers. > I am not sure what you were referring to w.r.t. clustering, but if you > meant that the NFS client can easily do an RPC with a larger I/O size > than the size of the buffer handed it by the buffer cache, I'd like to > hear how that's done? (If not, then a bigger buffer from the buffer > cache is what I need to do a larger I/O size in the RPC.) Clustering is currently only for the local file system, at least for the old nfs server. nfs just does a VOP_READ() into its own buffer, with ioflag set to indicate nfs's idea of sequentialness. (User reads are similar except their uio destination is UIO_USERSPACE instead of UIO_SYSSPACE and their sequentialness is set generically and thus not so well (but the nfs setting isn't very good either).) The local file system then normally does a clustered read into a larger buffer, with the sequentialness affecting mainly startup (per-file), and virtually copies the results to the local file system's smaller buffers. VOP_READ() completes by physically copying the results to nfs's buffer (using bcopy() for UIO_SYSSPACE and copyout() for UIO_USERSPACE). nfs can't easily get at the larger clustering buffers or even the local file system's buffers. It can more easily benefit from larger MAXBSIZE. There is still the bcopy() to take a lot of CPU and memory bus resources, but that is insignifcant compared with WAN latency. But as I said in a related thread, even the current MAXBSIZE is too large to use routinely, due to buffer cache fragmentation causing significant latency problems, so any increase in MAXBSIZE and/or routine use of buffers of that size needs to be accompanied by avoiding the fragmentation. Note that the fragmentation is avoided for the larger clustering buffers by allocating them from a different pool. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20100414135230.U12587>