Date: Thu, 26 May 2005 09:02:59 -0400 From: Sven Willenberger <sven@dmv.com> To: Bruce Evans <bde@zeta.org.au> Cc: freebsd-amd64@FreeBSD.org Subject: Re: BKVASIZE for large block-size filesystems Message-ID: <1117112579.15065.30.camel@lanshark.dmv.com> In-Reply-To: <20050526090743.S75084@delplex.bde.org> References: <1117055183.13183.57.camel@lanshark.dmv.com> <20050526090743.S75084@delplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 2005-05-26 at 10:38 +1000, Bruce Evans wrote: > On Wed, 25 May 2005, Sven Willenberger wrote: > > > [originally posted to freebsd-stable, realized that some amd64-specific > > info may be needed here too] > > It's not very amd64-specific due to bugs. BKVASIZE and algorithms that > use it are tuned for i386's. This gives mistuning for arches that have > more kernel virtual address space. > > > FreeBSD5.4-Stable amd64 on a dual-opteron system with LSI-Megaraid 400G+ > > partion. The filesystem was created with: newfs -b 65536 -f 8192 -e > > 15835 /dev/amrd2s1d > > > > This is the data filesystem for a PostgreSQL database; as the default > > page size (files) is 8k, the above newfs scheme has 8k fragments which > > should fit nicely with the PostgreSQL page size. Now by default param.h > > Fragments don't work very well. It might be better to fit files to the > block size. If all files had size 8K, then -b 8192 -f 8192 would work > best (slightly better than -b 8192 -f 1024, that slightly better than > the current defaults, and all much better than -b 65536 -f 8192). > Oh, how I wish I would have known that prior to creating the filesystem. I wanted to avoid -b 8192 -f 1024 because of the small fragment size; I had assumned that fragment size matching the file page size used by the database would be ideal. Since the manpages seem to imply that anything other than 8:1 ratio of blocksize to fragment size would be detrimental I stayed away from -b 8192 -f 8192. I am curious as to what the concept behind fragments are then (versus my picture) and why they "don't work very well" ... > > defines BKVASIZE as 16384 (which has been pointed out in other posts as > > being *not* twice the default blocksize of 16k). I have modified it to > > be set at 32768 but still see a high and increasing value of > > vfs.bufdefragcnt which makes sense given the blocksize of the major > > filesystem in use. > > Yes, a block size larger than BKVASIZE will cause lots of fragmentation. > I'm not sure if this is still a large pessimization. > > > My question is are there any caveats about increasing BKVASIZE to 65536? > > The system has 8G of RAM and I understand that nbufs decreases with > > increasing BKVASIZE; > > The decrease in nbufs is a bug. It defeats half of the point of increasing > BKVASIZE: if most buffers have size 64K, then increasing BKVASIZE from 16K > to 64K gives approximately nbuf/4 buffers all of size 64K instead of nbuf > buffers, with nbuf/4 of them of size 64K and 3*nbuf/4 of them unusable. > Thus it avoids some resource wastage at a cost of possibly not using enough > resources for effective caching. However, little is lost if most buffers > have size 64K. Then the reduced nbuf consumes all of the kva resources that > we are willing to allocate. The problem is when file systems are mixed and > ones with a block size of 64K are not used much or at all. The worst case > is when all blocks have size 512, which can happen for msdosfs. Then up > to (BKVASIZE - 512) / BKVASIZE of the kva resource is wasted (> 99% for > BKVASIZE = 65536 but only 97% for BKVASIZE = 16384). > > To fix the bug, change BKVASIZE in kern_vfs_bio_buffer_alloc() to 16384 > and consider adjusting the machbcache tunable (see below). > Ahh, so this is literal replace the word "BKVASIZE" in that function with the word "16384". I am assuming that I can leave other instances of BKVASIZE and BKVAMASK in that file (vfs_bio.c) alone then? > > how can I either determine if the resulting nbufs > > will be sufficient or calculate what is needed based on RAM and system > > usage? > > nbuf is not directly visible except using a debugger, but vfs.maxbufspace > gives it indirectly -- divide the latter by BKVASIZE to get nbuf. A few > thousand for it is plenty. > > I used to use BKVASIZE = 65536, and fixed the bug as above, and also doubled > nbuf in kern_vfs_bio_buffer_alloc(), and also configured VM_BCACHE_SIZE_MAX > to 512M so that the elevated nbuf was actually used, but the need for > significantly increasing the default nbuf (at least with BKVASIZE = 16384) > went away many years ago when memory sizes started exceeding 256M or so. > My doubling of nbuf broke a few years later when memory sizes started > exceeding 1GB. i386's just don't have enough virtual address space to use > a really large nbuf, so when there is enough physical memory the default > nbuf is as large as possible. I was only tuning BKVASIZE and > VM_BCACHE_SIZE_MAX to benchmark file systems with large block sizes, but > the performance with large block sizes was poor even with this tuning so > I lost interest in it. Now I just use the defaults and the bug fix > reduces to a spelling change. nbuf defeaults to about 7000 on my machines > with 1GB of memory. This is plenty. With BKVASIZE = 64K and without the > fix, it would be 1/4 as much, which seems a little low. > > nbuf is also limited by kernel virtual memory. amd's have more (I'm not > sure how much), and they should have so much more that the bcache part > is effectively infinity, but it is or was actually only twice as much > as on i386's (default VM_BCACHE_SIZE_MAX = 200MB on i386's and 400MB > on amd64's). Even i386's can spare more provided the memory is not > needed for other things, e.g., networking. The default of 400MB on > amd64's combined with BKVASIZE gives a limit on nbuf of 400MB/64K = 6400 > which is plently, so you shouldn't need to change the bcache tunable. > I shall leave that tunable alone then. > > Also, will increasing BKVASIZE require a complete make buildworld or, if > > not, how can I remake the portions of system affected by BKVASIZE? > > It's not a properly supported option, so the way to change it is to > edit it in the sys/param.h source file. After changing it there, > the everything will be rebuilt as necessary by makeworld and/or > rebuilding kernels. Unfortunately, almost everything will be rebuilt > because too many things depend on sys/param.h. When testing > changes to BKVASIZE, I used to cheat by preserving the timestamp of > sys/param.h and manually recompiling only the necessary things. Very > little depends on BKVASIZE. IIRC, there used to be 2 object files > per kernel, but now there is only 1 (vfs_bio.o). > > Bruce Sounds good; I appreciate the input and the explanations -- really cleared up a good bit of stuff for me. Thanks, Sven
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1117112579.15065.30.camel>