Date: Tue, 30 Jul 2013 00:59:01 +1000 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Ali Niknam <ali@transip.nl> Cc: freebsd-fs@FreeBSD.org, rmacklem@FreeBSD.org Subject: Re: nfsclient: incorrect st_blksize (bug?) Message-ID: <20130729235447.S1849@besplex.bde.org> In-Reply-To: <51F66217.5080505@transip.nl> References: <51F66217.5080505@transip.nl>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, 29 Jul 2013, Ali Niknam wrote: > I've come across a problem that has proven to be unsolvable for me so far. It > might be a bug in the NFS Client code, it also be my general lack of > knowledge :). Can someone please give me a hint in the right direction? > > This is the case: > > mount_nfs -o rsize=32768 -o wsize=32768 -o nfsv4 -o tcp host:/path /mnt/nfs > > stat /mnt/nfs gives st_blksize of 4096 bytes. > statfs /mnt/nfs gives an iosize of 4096 bytes. > > Mounting with nfsv3 gives the same results, regardless of udp or tcp > protocol. NFSv2 however seems to give a st_blksize of 128k, with an iosize of > 8192 bytes. > > In short: it seems that with BSD 9.1 the rsize/wsize's arent passed along > correctly. I tried to debug it by looking in the kernel code but I got lost > unfortunately in the abstraction layers (everything seems to set > NFS_FABLKSIZE). > > Mounting the same host on a linux machine gives the correct st_blksize (32k). > > The disadvantage is ofcourse that apache/etc adhere to the 4k st_blksize by > only reading 4k chunks so that nfs io slows down substantially. nfs still seems to seems to ask for a blocksize of NFS_FABLKSIZE = 512. Old versions of FreeBSD honored the leaf file system's idea of the best block size and gave this 512. After many intermediate broken versions, vn_stat() now has a hack that involves it using PAGE_SIZE iff the leaf file system prefers a smaller size, so 512 becomes 4096 on x86. 4096 is not as bad as 512, but still too small for most purposes. OTOH, 512 works quite well for nfs over local networks with low latency. 512 fits in a 1500-byte packet but 4096 doesn't, so latency can be better with small block sizes and lower latency also gives higher throughput provided everything can keep up with the small blocks. A workaround might by to use statfs() instead of stat(). st_blksize can vary within a file system in theory, but usually doesn't, and can't be trusted anyway. struct statfs has fields f_bsize ("fragment" size) and f_iosize (optimal transfer size). These seem to be set better by leaf file systems, and are certainly never frobbed by upper layers (except to translate to old statfs()). nfs still seems to set f_bsize to NFS_FABSLKSIZE, but it sets f_iosize to its i/o size. ffs sets f_bsize to its fragment size (not so good. statfs() can't even respresent ffs's 2 types of block size. Neither can stat(), but st_blksize is initialized with the other one, so unportable code can determine both). ffs sets f_iosize to a disk-specific size. There are many bugs in the setting of the latter too, and it now almost always reduces to a hard-coded setting of MAXPHYS that has nothing to do with disks' preferred sizes. Hard-coding of MAXPHYS everywhere would be OK for throughput but not so good for latency. To optimize for latency, there seems to be nothing better than using statfs()'s f_bsize, but we know that that reduces to a hard-coded 512 for nfs and to the not-necessarily best fragment size for ffs. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20130729235447.S1849>