From owner-svn-src-all@FreeBSD.ORG Sat Jun 15 19:56:05 2013 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 78FC2724; Sat, 15 Jun 2013 19:56:05 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id 26F8B1777; Sat, 15 Jun 2013 19:56:04 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id 1C3573C1A1D; Sun, 16 Jun 2013 05:55:57 +1000 (EST) Date: Sun, 16 Jun 2013 05:55:56 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov Subject: Re: svn commit: r251282 - head/sys/kern In-Reply-To: <20130615104301.GL91021@kib.kiev.ua> Message-ID: <20130616034707.A899@besplex.bde.org> References: <201306030416.r534GmCA001872@svn.freebsd.org> <51AC1B49.9090001@mu.org> <20130603075539.GK3047@kib.kiev.ua> <51AC60CA.6050105@mu.org> <20130604052219.GP3047@kib.kiev.ua> <20130604170410.M1018@besplex.bde.org> <20130615104301.GL91021@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=K8x6hFqI c=1 sm=1 a=kj_ttALqKtUA:10 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=M4roAWbnUW4A:10 a=LvEFDjEoGOWfYtT-BE0A:9 a=CjuIK1q_8ugA:10 a=UTapKURaPes6iDck:21 a=8NcMTnSIyR3zUjTN:21 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Bruce Evans X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 15 Jun 2013 19:56:05 -0000 On Sat, 15 Jun 2013, Konstantin Belousov wrote: > On Tue, Jun 04, 2013 at 06:14:49PM +1000, Bruce Evans wrote: >> On Tue, 4 Jun 2013, Konstantin Belousov wrote: >> >>> On Mon, Jun 03, 2013 at 02:24:26AM -0700, Alfred Perlstein wrote: >>>> On 6/3/13 12:55 AM, Konstantin Belousov wrote: >>>>> On Sun, Jun 02, 2013 at 09:27:53PM -0700, Alfred Perlstein wrote: >>>>>> Hey Konstaintin, shouldn't this be scaled against the actual amount of >>>>>> KVA we have instead of an arbitrary limit? >>>>> The commit changes the buffer cache to scale according to the available >>>>> KVA, making the scaling less dumb. >>>>> >>>>> I do not understand what exactly do you want to do, please describe the >>>>> algorithm you propose to implement instead of my change. >>>> >>>> Sure, how about deriving the hardcoded "32" from the maxkva a machine >>>> can have? >>>> >>>> Is that possible? >>> I do not see why this would be useful. Initially I thought about simply >>> capping nbuf at 100000 without referencing any "memory". Then I realized >>> that this would somewhat conflict with (unlikely) changes to the value >>> of BKVASIZE due to "factor". >> >> The presence of BKVASIZE in 'factor' is a bug. My version never had this >> bug (see below for a patch). The scaling should be to maximize nbuf, >> subject to non-arbitrary limits on physical memory and kva, and now an >> arbirary limit of about 100000 / (BKVASIZE / 16384) on nbuf. Your new >> limit is arbitrary so it shouldn't affect nbuf depending on BKVASIZE. > > I disagree with the statement that the goal is to maximize nbuf. The > buffer cache currently is nothing more then a header and i/o record for > the set of the wired pages. For non-metadata on UFS, buffers doenot map > the pages into KVA, so it becomes purely an array of pointers to page > and some additional bookkeeping. Er, since dyson and I designed BKVASIZE with that goal, I know what its goal is. > I want to eventually break the coupling between size of the buffer map > and the nbuf. Right now, typical population of the buffer map is around > 20%, which means that we waste >= 100MB of KVA on 32bit machines, where > the KVA is precious. I would also consider shrinking the nbufs much > lower, but the cost of wiring and unwiring the pages for the buffer > creation and reuse is the blocking point. Yes, "some additional bookkeeping" is "a lot of additional bookkeeping" when nbufs is low relative to the number of active disk blocks. Small block sizes expand the number of active disk blocks by a large factor. E.g., 64 for ffs's default block size of 32K relative to msdosfs's smallest block size of 512. This reminds me that I tied to get dyson to implement a better kva allocation scheme. At a cost of dividing the nominal number of buffers by a factor of about 5, but with a gain of avoiding all fragmentation and all kva allocation overheads, small block sizes down to size PAGE_SIZE can have as much space allocated for them (space = number of buffers of this size times block size) as large blocks. Use a power of 2 method. Start with a desired value of nbuf and sacrifice a large fraction of it: numbers with NOMBSIZE = 16K and PAGE_SIZE = 4K: statically allocate kva for nbuf/4 buffers of kvasize 64K each statically allocate kva for nbuf/2 buffers of kvasize 32K statically allocate kva for nbuf/1 buffers of kvasize 16K statically allocate kva for 2*nbuf buffers of kvasize 8K statically allocate kva for 4*nbuf buffers of kvasize 4K Total allocations: 7.75*nbuf buffers of kvasize 5*nbuf*16K. To avoid expanding total kvasize, reduce nbuf by a factor of 5. This doesn't work so well for fs block sizes of < 4K. Allocate many more than 4*nbuf buffers of size 4K to support them. Expanding nbuf would waste kva, but currently, expanding nbuf wastes 4 times as much kva and also messes up secondary variables like the dirty buffer watermarks. There is still the cost of mapping buffers into the allocated kva, but with more buffers of smaller sizes there is less thrashing of the buffers so less remappings. When dyson implemented BKVASIZE in 1996, the whole i386 kernel only had 256MB, so fitting enough buffers into it was even harder than. The i386 kernel kva size wasn't increased to its current 1GB until surprisingly recently (1999). > ... >> BKVASIZE was originally 8KB. I forget if nbuf was halved by not modifying >> the scale factor when it was expanded to 16KB. Probably not. I used to >> modify the scale factor to get twice as many as the default nbuf, but >> once the default nbuf expanded to a few thousand it became large enough >> for most purposes so I no longer do this. > Now, with the default UFS block size being 32KB, it is effectively halved > once more. Yes, in a bad way for ffs. When most block sizes are 32K, it is only possible to use half of nbuf. Fragmentation occurs if there are mixtures of 32K-blocks and other block sizes. Fragmentation wastes time (also space, but no more than is already wasted statically). BKVASIZE should have been doubled to match the doubling of the default block size (its comment still hasn't caught up with the previous doublinf of the default ffs block size, and still says that BKVASIZE is "2x the block size used ny a normal UFS [sic] file system", and warns about the danger of making it too small), but then file systems with smaller block sizes would be penalized. The result is similar to that given by my power of 2 method with 2 buffer sizes: statically allocate kva for nbuf/2 buffers of kvasize 32K statically allocate kva for nbuf/1 buffers of kvasize 16K except it uses 2/3 as many buffers and 1/2 as much kva as my method, at a cost of complexity and fragmentation. Also note that with BKVASIZE = 32K, it is only a factor of 2 away from MAXBSIZE = 64K (until that is increased), so you could increase BKVASIZE by another factor of 2 and only halve nbuf by another factor of 2. The complexity and fragmentation goes away. Increasing MAXBSIZE would cause interesting problems. Fragmentation would be severe if some block sizes are many more factors of 2 larger than BKVASIZE. If MAXBSIZE is really large (say 1MB), then you can't increase BKVASIZE to it without wasting a really large amount of kva or reducing nbuf really signficantly, so dynamic sizing becomes necessary again, perhaps even on 64-bit arches. Neither MAXBSIZE nor BKVASIZE are kernel options. BKVASIZE should have been one from the beginning. An optional MAXBSIZE hae much wider scope. For example, systems with a larger MAXBSIZE can create ffs file systems that cannot be mounted on systems with the historical MAXBSIZE. Bruce