From owner-freebsd-arch@FreeBSD.ORG Mon Apr 12 16:02:28 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 88C331065674 for ; Mon, 12 Apr 2010 16:02:28 +0000 (UTC) (envelope-from avg@freebsd.org) Received: from citadel.icyb.net.ua (citadel.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id CA6088FC13 for ; Mon, 12 Apr 2010 16:02:27 +0000 (UTC) Received: from odyssey.starpoint.kiev.ua (alpha-e.starpoint.kiev.ua [212.40.38.101]) by citadel.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id TAA29562; Mon, 12 Apr 2010 19:02:11 +0300 (EEST) (envelope-from avg@freebsd.org) Message-ID: <4BC34402.1050509@freebsd.org> Date: Mon, 12 Apr 2010 19:02:10 +0300 From: Andriy Gapon User-Agent: Thunderbird 2.0.0.24 (X11/20100319) MIME-Version: 1.0 To: Bruce Evans References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> <20100411114405.L10562@delplex.bde.org> In-Reply-To: <20100411114405.L10562@delplex.bde.org> X-Enigmail-Version: 0.95.7 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 12 Apr 2010 16:02:28 -0000 on 11/04/2010 05:56 Bruce Evans said the following: > On Fri, 9 Apr 2010, Andriy Gapon wrote: [snip] >> I have lightly tested this under qemu. >> I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. >> I removed size > MAXBSIZE check in getblk (see a parallel thread >> "panic: getblk: >> size(%d) > MAXBSIZE(%d)"). > > Did you change the other known things that depend on this? There is the > b_pages limit of MAXPHYS bytes which should be checked for in another > way I changed the check the way I described in the parallel thread. > and the soft limits for hibufspace and lobufspace which only matter > under load conditions. And what these should be? hibufspace and lobufspace seem to be auto-calculated. One thing that I noticed and that was a direct cause of the problem described below, is that difference between hibufspace and lobufspace should be at least the maximum block size allowed in getblk() (perhaps it should be strictly equal to that value?). So in my case I had to make that difference MAXPHYS. >> And I bumped MAXPHYS to 1MB. >> >> Some results. >> I got no panics, data was read correctly and system remained stable, >> which is good. >> But I observed reading process (dd bs=1m on avgfs) spending a lot of >> time sleeping >> on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. >> Apparently there was some shortage of free buffers. >> Perhaps some limits/counts were incorrectly auto-tuned. > > This is not surprising, since even 64K is 4 times too large to work > well. Buffer sizes of larger than BKVASIZE (16K) always cause > fragmentation of buffer kva. Recovering from fragmentation always > takes a lot of CPU, and if you are unlucky it will also take a lot of > real time (stalling waiting for free buffer kva). Buffer sizes larger > than BKVASIZE also reduce the number of available buffers significantly > below the number of buffers configured. This mainly takes a lot of > CPU to reconsitute buffers. BKVASIZE being less than MAXBSIZE is a > hack to reduce the amount of kva statically allocated for buffers for > systems that cannot support enough kva to work right (mainly i386's). > It only works well when it is not actually used (when all buffers have > size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to > BKVASIZE). This hack and the complications to support it are bogus on > systems that support enough kva to work right. So, BKVASIZE is the best read size from the point of view of buffer space usage? E.g. a single MAXBSIZE=64K read results in a single 64K GEOM read requests, but leads to buffer space map fragmentation, because of size > BKVASIZE. On the other hand, four sequential reads of BKVASIZE=16K bytes are perfect from buffer space point of view (no fragmentation potential) but they result in 4 GEOM I/O requests. The thing is that a single read requires a single contiguous virtual address space chunk. Would it be possible to take the best of both worlds by somehow allowing a single large I/O request to work with several buffers (with b_kvasize == BKVASIZE) in a iovec-like style? Have I just reinvented bicycle? :) Probably not, because an answer to my question is probably 'not (without lots of work in lots of places)' as well. I see that breadn() certainly doesn't work that way. As I understand, it works like bread() for one block plus starts something like 'asynchronous breads()' for a given count of other blocks. I am not sure about details of how cluster_read() works, though. Could you please explain the essence of it? Thank you! Perhaps, there are other approaches to the fragmentation issue. Like, for example, using sort of zones for different block sizes. But that all adds complications and takes away performance of the easy cases. -- Andriy Gapon