From owner-freebsd-hackers@FreeBSD.ORG Sat Oct 25 15:05:45 2008 Return-Path: Delivered-To: hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 4D9C9106567F; Sat, 25 Oct 2008 15:05:45 +0000 (UTC) (envelope-from thierry.herbelot@laposte.net) Received: from postfix1-g20.free.fr (postfix1-g20.free.fr [212.27.60.42]) by mx1.freebsd.org (Postfix) with ESMTP id D207A8FC0A; Sat, 25 Oct 2008 15:05:44 +0000 (UTC) (envelope-from thierry.herbelot@laposte.net) Received: from smtp6-g19.free.fr (smtp6-g19.free.fr [212.27.42.36]) by postfix1-g20.free.fr (Postfix) with ESMTP id 34AF22D0A870; Sat, 25 Oct 2008 16:39:38 +0200 (CEST) Received: from smtp6-g19.free.fr (localhost.localdomain [127.0.0.1]) by smtp6-g19.free.fr (Postfix) with ESMTP id BF37E19799; Sat, 25 Oct 2008 16:39:36 +0200 (CEST) Received: from mail.herbelot.nom (bne75-4-82-227-159-103.fbx.proxad.net [82.227.159.103]) by smtp6-g19.free.fr (Postfix) with ESMTP id 58A6D1977D; Sat, 25 Oct 2008 16:39:35 +0200 (CEST) Received: from diversion.herbelot.nom (diversion.herbelot.nom [192.168.2.6]) by mail.herbelot.nom (8.14.1/8.14.1) with ESMTP id m9PEdNh2028982; Sat, 25 Oct 2008 16:39:25 +0200 (CEST) From: Thierry Herbelot To: Bruce Evans Date: Sat, 25 Oct 2008 16:39:17 +0200 User-Agent: KMail/1.9.10 References: <200810241818.37262.thierry@herbelot.com> <20081025203549.C76165@delplex.bde.org> In-Reply-To: <20081025203549.C76165@delplex.bde.org> X-Warning: Windows can lose your files X-Op-Sys: Le FriBi de la mort qui tue X-Org: TfH&Co X-MailScanner: Found to be clean MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 8bit Content-Disposition: inline Message-Id: <200810251639.17586.thierry.herbelot@laposte.net> Cc: freebsd-fs@freebsd.org, hackers@freebsd.org Subject: Re: question about sb->st_blksize in src/sys/kern/vfs_vnops.c X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 25 Oct 2008 15:05:45 -0000 Le Saturday 25 October 2008, Bruce Evans a écrit : > On Fri, 24 Oct 2008, Thierry Herbelot wrote: > > the [SUBJ] file contains the following extract (around line 705) : > > > > * Default to PAGE_SIZE after much discussion. > > * XXX: min(PAGE_SIZE, vp->v_bufobj.bo_bsize) may be more correct. > > */ > > > > sb->st_blksize = PAGE_SIZE; > > > > which arrived around four years ago, with revision 1.211 (see > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/vfs_vnops.c.diff?r1=1. > >210;r2=1.211;f=h) > > Indeed, this was completely broken long ago (in 1.211). Before then, and > after 1.128, some cases worked as intended if not perfectly: > - regular files: file systems still set va_blksize to their idea of the > best i/o size (normally to the file system block size, which is > normally larger than PAGE_SIZE and probably better in all cases) and > this was used here. However, for regular files, the fs block size > and the application's i/o size are almost irrelevant in most cases > due to vfs clustering. Most large i/o's are done physically with > the cluster size (which due to a related bug suite ends up being > hard-coded to MAXPHYS (128K) at a minor cost when this is different > from the best size). > - disk files: non-broken device drivers set si_iosize_best to their idea > of the best i/o size (normally to the max i/o size, which is normally > better than PAGE_SIZE) and this was used here. The bogus default > of BLKDEV_IOSIZE was used for broken drivers (this is bogus because it > was for the buffer cache implementation for block devices which no > longer exist and was too small for them anyway). > - non-disk character-special files: the default of PAGE_SIZE was used. > The comment about defaulting to PAGE_SIZE was added in 1.128 and is > mainly for this case. Now the comment is nonsense since the value is > fixed, not a default. > - other file types (fifos, pipes, sockets, ...): these got the default of > PAGE_SIZE too. > > In rev.1.1, st_blksize was set to va_blksize in all cases. So file systems > were supposed to set va_blksize reasonably in all cases, but this is not > easy and they did nothing good except for regular files. agreed, anyway the comment by phk about using ioctl(DIOCGSECTORSIZE) applies. > > Versions between 1.2 and 1.127 did weird things like defaulting to DFLTPHYS > (64K) for most cdevs but using a small size like BLKDEV_IOSIZE (2K) for > disks. This gave nonsense like 64K buffers for slow tty devices (keyboards) > and 2K buffers for fast disks. At least for programs that trust st_blksize > o be reasonable. Fortunately, st_blsize is rarely used... > > > the net effect of this change is to decrease the block buffer size used > > in libc/stdio from 16 kbytes (derived from the underlying ufs partition) > > to PAGE_SIZE ==4 kbytes (fixed value), and consequently the I/O bandwidth > > is lowered (this is on a slow Flash). > > ... except it is used by stdio. (Another mess here is that stdio mostly > doesn't use its own BUFSIZ. It trusts st_blksize if fstat() to determine This is indeed what I saw, meandering between the libc and the vfs part of the kernel. In fact, I was essentially wondering if st_blksize was used *elsewhere*, and bumping the value could break some memory allocation ... > st_blksize works. Of course, the existence of BUFSIZ is a related > historical mistake -- no fixed size can work best for all cases. But > when BUFSIZ is used, it is an even worse default than PAGE_SIZE.) (as it is even smaller ?) > > It's interesting that you can see the difference. Clustering is especially > good for hiding slowness on slow devices. Maybe you are using a > configuration that makes clustering ineffective. Mounting the file system > with -o sync or equivalently, doing a sync after every (too-small) write > would do it. Otherwise, writes are normally delated until the next cluster > boundary. My use case is for small (buffered) writes to a file between 4 kbytes and 16 16 kbytes. For example, writing a 16-kbyte file with a st_blksize of 4k is twice as slow as with 16k (220 ms compared to 110). The penalty is less for 8k-byte (105 ms vs 66). > > > I have patched the kernel with a larger, fixed value (simply 4*PAGE_SIZE, > > to revert to the block size previoulsly used), and the kernel and world > > seem to be running fine. > > > > Seeing the XXX coment above, I'm a bit worried about keeping this new > > st_blksize value. > > > > are there any drawbacks with running with this bigger buffer size value ? > > Mostly it doesn't matter, since buffering (clustering) hides the > differences. (as seen before, mostly) > Without clustering, 16K is a much better default for disks > than 4K, though not as good as the non-default va_blksize for regular > files. Newer disks might prefer 32K or 64k, but then the fs block size > should also be increased from 16K. Otherwise, increasing the block size > usually reduces performance, by thrashing caches or increasing latencies. > With modern cache sizes and disk speeds, you won't see these effects for a > block size of 64K, so defaulting to 64K would be reasonable for disks. It > would be silly for keyboards, but with modern memory sizes you would notice > this even less than when it was that in old versions. OK, thanks for the answer : I will submit the change to more stress tests and hope to shake it all before putting it to production. TfH > > Bruce