From owner-freebsd-amd64@FreeBSD.ORG Thu May 26 00:38:58 2005 Return-Path: X-Original-To: freebsd-amd64@FreeBSD.org Delivered-To: freebsd-amd64@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 1C29A16A41C for ; Thu, 26 May 2005 00:38:58 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout2.pacific.net.au (mailout2.pacific.net.au [61.8.0.85]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8CBFB43D48 for ; Thu, 26 May 2005 00:38:57 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout2.pacific.net.au (8.12.3/8.12.3/Debian-7.1) with ESMTP id j4Q0ckkG015259; Thu, 26 May 2005 10:38:46 +1000 Received: from katana.zip.com.au (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.12.3/8.12.3/Debian-7.1) with ESMTP id j4Q0cgMC012996; Thu, 26 May 2005 10:38:44 +1000 Date: Thu, 26 May 2005 10:38:44 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Sven Willenberger In-Reply-To: <1117055183.13183.57.camel@lanshark.dmv.com> Message-ID: <20050526090743.S75084@delplex.bde.org> References: <1117055183.13183.57.camel@lanshark.dmv.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-amd64@FreeBSD.org Subject: Re: BKVASIZE for large block-size filesystems X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 May 2005 00:38:58 -0000 On Wed, 25 May 2005, Sven Willenberger wrote: > [originally posted to freebsd-stable, realized that some amd64-specific > info may be needed here too] It's not very amd64-specific due to bugs. BKVASIZE and algorithms that use it are tuned for i386's. This gives mistuning for arches that have more kernel virtual address space. > FreeBSD5.4-Stable amd64 on a dual-opteron system with LSI-Megaraid 400G+ > partion. The filesystem was created with: newfs -b 65536 -f 8192 -e > 15835 /dev/amrd2s1d > > This is the data filesystem for a PostgreSQL database; as the default > page size (files) is 8k, the above newfs scheme has 8k fragments which > should fit nicely with the PostgreSQL page size. Now by default param.h Fragments don't work very well. It might be better to fit files to the block size. If all files had size 8K, then -b 8192 -f 8192 would work best (slightly better than -b 8192 -f 1024, that slightly better than the current defaults, and all much better than -b 65536 -f 8192). > defines BKVASIZE as 16384 (which has been pointed out in other posts as > being *not* twice the default blocksize of 16k). I have modified it to > be set at 32768 but still see a high and increasing value of > vfs.bufdefragcnt which makes sense given the blocksize of the major > filesystem in use. Yes, a block size larger than BKVASIZE will cause lots of fragmentation. I'm not sure if this is still a large pessimization. > My question is are there any caveats about increasing BKVASIZE to 65536? > The system has 8G of RAM and I understand that nbufs decreases with > increasing BKVASIZE; The decrease in nbufs is a bug. It defeats half of the point of increasing BKVASIZE: if most buffers have size 64K, then increasing BKVASIZE from 16K to 64K gives approximately nbuf/4 buffers all of size 64K instead of nbuf buffers, with nbuf/4 of them of size 64K and 3*nbuf/4 of them unusable. Thus it avoids some resource wastage at a cost of possibly not using enough resources for effective caching. However, little is lost if most buffers have size 64K. Then the reduced nbuf consumes all of the kva resources that we are willing to allocate. The problem is when file systems are mixed and ones with a block size of 64K are not used much or at all. The worst case is when all blocks have size 512, which can happen for msdosfs. Then up to (BKVASIZE - 512) / BKVASIZE of the kva resource is wasted (> 99% for BKVASIZE = 65536 but only 97% for BKVASIZE = 16384). To fix the bug, change BKVASIZE in kern_vfs_bio_buffer_alloc() to 16384 and consider adjusting the machbcache tunable (see below). > how can I either determine if the resulting nbufs > will be sufficient or calculate what is needed based on RAM and system > usage? nbuf is not directly visible except using a debugger, but vfs.maxbufspace gives it indirectly -- divide the latter by BKVASIZE to get nbuf. A few thousand for it is plenty. I used to use BKVASIZE = 65536, and fixed the bug as above, and also doubled nbuf in kern_vfs_bio_buffer_alloc(), and also configured VM_BCACHE_SIZE_MAX to 512M so that the elevated nbuf was actually used, but the need for significantly increasing the default nbuf (at least with BKVASIZE = 16384) went away many years ago when memory sizes started exceeding 256M or so. My doubling of nbuf broke a few years later when memory sizes started exceeding 1GB. i386's just don't have enough virtual address space to use a really large nbuf, so when there is enough physical memory the default nbuf is as large as possible. I was only tuning BKVASIZE and VM_BCACHE_SIZE_MAX to benchmark file systems with large block sizes, but the performance with large block sizes was poor even with this tuning so I lost interest in it. Now I just use the defaults and the bug fix reduces to a spelling change. nbuf defeaults to about 7000 on my machines with 1GB of memory. This is plenty. With BKVASIZE = 64K and without the fix, it would be 1/4 as much, which seems a little low. nbuf is also limited by kernel virtual memory. amd's have more (I'm not sure how much), and they should have so much more that the bcache part is effectively infinity, but it is or was actually only twice as much as on i386's (default VM_BCACHE_SIZE_MAX = 200MB on i386's and 400MB on amd64's). Even i386's can spare more provided the memory is not needed for other things, e.g., networking. The default of 400MB on amd64's combined with BKVASIZE gives a limit on nbuf of 400MB/64K = 6400 which is plently, so you shouldn't need to change the bcache tunable. > Also, will increasing BKVASIZE require a complete make buildworld or, if > not, how can I remake the portions of system affected by BKVASIZE? It's not a properly supported option, so the way to change it is to edit it in the sys/param.h source file. After changing it there, the everything will be rebuilt as necessary by makeworld and/or rebuilding kernels. Unfortunately, almost everything will be rebuilt because too many things depend on sys/param.h. When testing changes to BKVASIZE, I used to cheat by preserving the timestamp of sys/param.h and manually recompiling only the necessary things. Very little depends on BKVASIZE. IIRC, there used to be 2 object files per kernel, but now there is only 1 (vfs_bio.o). Bruce