From owner-svn-src-head@FreeBSD.ORG  Fri Jun  7 20:02:01 2013
Return-Path: <owner-svn-src-head@FreeBSD.ORG>
Delivered-To: svn-src-head@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by hub.freebsd.org (Postfix) with ESMTP id 5208F750;
 Fri,  7 Jun 2013 20:02:01 +0000 (UTC) (envelope-from bright@mu.org)
Received: from elvis.mu.org (elvis.mu.org [192.203.228.196])
 by mx1.freebsd.org (Postfix) with ESMTP id 3BA25122B;
 Fri,  7 Jun 2013 20:02:01 +0000 (UTC)
Received: from kruse-124.4.ixsystems.com (drawbridge.ixsystems.com
 [206.40.55.65])
 by elvis.mu.org (Postfix) with ESMTPSA id EC7BC1A3C3E;
 Fri,  7 Jun 2013 13:01:50 -0700 (PDT)
Message-ID: <51B23C2D.7000203@mu.org>
Date: Fri, 07 Jun 2013 13:01:49 -0700
From: Alfred Perlstein <bright@mu.org>
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7;
 rv:17.0) Gecko/20130509 Thunderbird/17.0.6
MIME-Version: 1.0
To: Adrian Chadd <adrian@freebsd.org>
Subject: Re: svn commit: r251282 - head/sys/kern
References: <201306030416.r534GmCA001872@svn.freebsd.org>
 <51AC1B49.9090001@mu.org> <20130603075539.GK3047@kib.kiev.ua>
 <51AC60CA.6050105@mu.org> <20130604052219.GP3047@kib.kiev.ua>
 <20130604170410.M1018@besplex.bde.org> <51AF2AE8.4080308@mu.org>
 <CAJ-VmonMN-HS-VorKqmxN1zfyh5NBvVZiZXb3ZuKbz7h=qwA1A@mail.gmail.com>
In-Reply-To: <CAJ-VmonMN-HS-VorKqmxN1zfyh5NBvVZiZXb3ZuKbz7h=qwA1A@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: Konstantin Belousov <kostikbel@gmail.com>, svn-src-head@freebsd.org,
 svn-src-all@freebsd.org, src-committers@freebsd.org,
 Bruce Evans <brde@optusnet.com.au>
X-BeenThere: svn-src-head@freebsd.org
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: SVN commit messages for the src tree for head/-current
 <svn-src-head.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head>
List-Post: <mailto:svn-src-head@freebsd.org>
List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-head>,
 <mailto:svn-src-head-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 07 Jun 2013 20:02:01 -0000

On 6/5/13 10:55 AM, Adrian Chadd wrote:
> ... can people please boot an i386 kernel w/ this stuff in it, with
> the RAM constrained to something small (say, 128mb), and verify that
> you can at least start a buildworld (minus clang, of course) without
> panicing?
>
> There's a recent PR
> (http://www.freebsd.org/cgi/query-pr.cgi?pr=179112) which shows
> regressions on i386 + small RAM platforms.
>
> I know it's a different area, but I'd really appreciate it if people
> better tested this stuff out before they make i386 (and by extension,
> any platform with limited KVA and small amounts of RAM) unusable
> through carelessness.

Adrian, I'm pretty sure that kib's patch actually limits memory.

The concern I have is that it's a hard limit that appears not based on 
KVA, but rather 32GB.

I can be wrong, but Bruce's feedback looks like it backs up my concern.

-Alfred

>
> Thanks,
>
>
>
> Adrian
>
> On 5 June 2013 05:11, Alfred Perlstein <bright@mu.org> wrote:
>> Konstantin, is there any way to take some of Bruce's feedback into account
>> to get rid of the hard coded max?
>>
>> -Alfred
>>
>>
>> On 6/4/13 1:14 AM, Bruce Evans wrote:
>>> On Tue, 4 Jun 2013, Konstantin Belousov wrote:
>>>
>>>> On Mon, Jun 03, 2013 at 02:24:26AM -0700, Alfred Perlstein wrote:
>>>>> On 6/3/13 12:55 AM, Konstantin Belousov wrote:
>>>>>> On Sun, Jun 02, 2013 at 09:27:53PM -0700, Alfred Perlstein wrote:
>>>>>>> Hey Konstaintin, shouldn't this be scaled against the actual amount of
>>>>>>> KVA we have instead of an arbitrary limit?
>>>>>> The commit changes the buffer cache to scale according to the available
>>>>>> KVA, making the scaling less dumb.
>>>>>>
>>>>>> I do not understand what exactly do you want to do, please describe the
>>>>>> algorithm you propose to implement instead of my change.
>>>>>
>>>>> Sure, how about deriving the hardcoded "32" from the maxkva a machine
>>>>> can have?
>>>>>
>>>>> Is that possible?
>>>> I do not see why this would be useful. Initially I thought about simply
>>>> capping nbuf at 100000 without referencing any "memory". Then I realized
>>>> that this would somewhat conflict with (unlikely) changes to the value
>>>> of BKVASIZE due to "factor".
>>>
>>> The presence of BKVASIZE in 'factor' is a bug.  My version never had this
>>> bug (see below for a patch).  The scaling should be to maximize nbuf,
>>> subject to non-arbitrary limits on physical memory and kva, and now an
>>> arbirary limit of about 100000 / (BKVASIZE / 16384) on nbuf.  Your new
>>> limit is arbitrary so it shouldn't affect nbuf depending on BKVASIZE.
>>>
>>> Expanding BKVASIZE should expand kva use, but on i386 this will soon
>>> hit a non-arbitary kva limit so nbuf will not be as high as preferred.
>>> nbuf needs to be very large mainly to support file systems with small
>>> buffers.  Even 100000 only gives 50MB of buffering if the fs block
>>> size is 512.  This would shrink to only 12.5MB if BKVASIZE is expanded
>>> by a factor of 4 and the bug is not fixed.  If 25000 buffers after
>>> expanding BKVASIZE is enough, then that should be the arbitary limit
>>> (independent of BKVASIZE) so as to save physical memory.
>>>
>>> On i386 systems with 1GB RAM, nbuf defaults to about 7000.  With an
>>> fs block size of 512, that can buffer 3.5MB.  Expanding BKVASIZE by a
>>> factor of 4 shrinks this to 0.875MB in -current.  That is ridiculously
>>> small.  VMIO limits the lossage from this.
>>>
>>> BKVASIZE was originally 8KB.  I forget if nbuf was halved by not modifying
>>> the scale factor when it was expanded to 16KB.  Probably not.  I used to
>>> modify the scale factor to get twice as many as the default nbuf, but
>>> once the default nbuf expanded to a few thousand it became large enough
>>> for most purposes so I no longer do this.
>>>
>>> Except on arches with extremely limit kva like i386, KVASIZE should be
>>> MAXBSIZE, and all of the complications for expanding a buffer's kva
>>> beyond BKVASIZE and handling the fragmentation problems caused by this
>>> shouldn't exist.  Limits like vfs.maxbufspace should be scaled by
>>> NOMBSIZE = current BKVASIZE instead of BKVASIZE.  Oops, my NOMBSIZE
>>> has similar logical problems to BKVASIZE.  It needs to be precisely
>>> 16KB to preserve the defaults for nbuf and maxbufspace, but the nominal
>>> (ffs) buffer size is now 32KB.  So the heuristic scale factors should
>>> use the historical magic number 16K.  I changed sequential_heuristic()
>>> to use a hard-coded 16K instead of BKVASIZE.  The correct number here
>>> depends on disk technology.
>>>
>>> The patch has many style fixes:
>>>
>>> @ Index: vfs_bio.c
>>> @ ===================================================================
>>> @ RCS file: /home/ncvs/src/sys/kern/vfs_bio.c,v
>>> @ retrieving revision 1.436
>>> @ diff -u -2 -r1.436 vfs_bio.c
>>> @ --- vfs_bio.c    17 Jun 2004 17:16:49 -0000    1.436
>>> @ +++ vfs_bio.c    3 Jun 2013 16:04:34 -0000
>>> @ @@ -419,64 +493,54 @@
>>> @ @  /*
>>> @ - * Calculating buffer cache scaling values and reserve space for buffer
>>> @ + * Calculate buffer cache scaling values and reserve space for buffer
>>> @   * headers.  This is called during low level kernel initialization and
>>> @   * may be called more then once.  We CANNOT write to the memory area
>>> @   * being reserved at this time.
>>> @   */
>>> @ -caddr_t
>>> @ -kern_vfs_bio_buffer_alloc(caddr_t v, long physmem_est)
>>> @ +void *
>>> @ +vfs_bio_alloc(void *va)
>>>
>>> The API name was verbose and bogus.  The prefix for this subsystem is
>>> vfs_bio_, not kern_vfs_bio_.  This is a generic allocation routine.
>>> It always allocated swbufs and has expanded to do more allocations.
>>>
>>> @  {
>>> @ -    /*
>>> @ -     * physmem_est is in pages.  Convert it to kilobytes (assumes
>>> @ -     * PAGE_SIZE is >= 1K)
>>> @ -     */
>>> @ -    physmem_est = physmem_est * (PAGE_SIZE / 1024);
>>>
>>> I use the global physmem.  This change may be too i386-specific. In
>>> with 8-16 RAM, a significant amount of RAM may be eaten by the kernel
>>> image or perhaps just unavailable to the kernel.  Now memories are
>>> larger and the amount eaten is relatively insignificant (since we are
>>> only using a small fraction of physmem, only the relative error matters).
>>>
>>> @ +    int factor, memsize;
>>> @ @      /*
>>> @ -     * The nominal buffer size (and minimum KVA allocation) is
>>> BKVASIZE.
>>> @ -     * For the first 64MB of ram nominally allocate sufficient buffers
>>> to
>>> @ -     * cover 1/4 of our ram.  Beyond the first 64MB allocate additional
>>> @ -     * buffers to cover 1/20 of our ram over 64MB.  When auto-sizing
>>> @ -     * the buffer cache we limit the eventual kva reservation to
>>> @ +     * The nominal buffer size is NOMBSIZE.
>>> @ +     * For the first 4MB of RAM, allocate 50 buffers.
>>> @ +     * For the next 60MB of RAM, nominally allocate sufficient buffers
>>> to
>>> @ +     * cover 1/4 of the RAM.  Beyond the first 64MB allocate additional
>>> @ +     * buffers to cover 1/20 of the RAM.  When auto-sizing
>>> @ +     * the buffer cache, limit the eventual kva reservation to
>>> @       * maxbcache bytes.
>>>
>>> Fix old bitrot in this comment, and update for my changes.
>>> - the first 4MB wasn't mentioned
>>> - the minimum clamp wasn't mentioned
>>> - replace BKVASIZE by NOMBSIZE so that I can change BKVASIZE without
>>>    changing the default nbuf (subject to maxbcache).
>>>
>>> The bitrot of the magic number 1/20 is not fixed.  The code uses 1/10.
>>> This is fixed in -current.  I should have noticed this before, since
>>> I used to have to change the code to get 1/10 instead of 1/20.
>>>
>>> I hope you updated the comment for recent changes, but I don't like
>>> having magic numbers in comments.
>>>
>>> @       *
>>> @ -     * factor represents the 1/4 x ram conversion.
>>> @ +     * `factor' represents the 1/4 of RAM conversion.
>>> @       */
>>> @ +#define    NOMBSIZE    imax(PAGE_SIZE, 16384)    /* XXX */
>>> @      if (nbuf == 0) {
>>> @ -        int factor = 4 * BKVASIZE / 1024;
>>> @ +        /*
>>> @ +         * physmem is in pages.  Convert it to a size in kilobytes.
>>> @ +         * XXX no ptob().
>>> @ +         */
>>> @ +        memsize = ctob(physmem) / 1024;
>>> @ @ +        factor = 4 * NOMBSIZE / 1024;
>>> @          nbuf = 50;
>>> @ -        if (physmem_est > 4096)
>>> @ -            nbuf += min((physmem_est - 4096) / factor,
>>> @ -                65536 / factor);
>>> @ -        if (physmem_est > 65536)
>>> @ -            nbuf += (physmem_est - 65536) * 2 / (factor * 5);
>>> @ -
>>> @ -        if (maxbcache && nbuf > maxbcache / BKVASIZE)
>>> @ -            nbuf = maxbcache / BKVASIZE;
>>>
>>> The maxbcache limit must be applied to the non-default nbuf too.
>>>
>>> @ +        if (memsize > 4096)
>>> @ +            nbuf += imin((memsize - 4096) / factor,
>>> @ +                (65536 - 4096) / factor);
>>> @ +        if (memsize > 65536)
>>> @ +            nbuf += (memsize - 65536) * 2 / (factor * 5);
>>> @      }
>>> @ -
>>> @ -#if 0
>>> @ -    /*
>>> @ -     * Do not allow the buffer_map to be more then 1/2 the size of the
>>> @ -     * kernel_map.
>>> @ -     */
>>> @ -    if (nbuf > (kernel_map->max_offset - kernel_map->min_offset) / @ -
>>> (BKVASIZE * 2)) {
>>> @ -        nbuf = (kernel_map->max_offset - kernel_map->min_offset) / @ -
>>> (BKVASIZE * 2);
>>> @ -        printf("Warning: nbufs capped at %d\n", nbuf);
>>> @ -    }
>>> @ -#endif
>>>
>>> Already removed in -current.
>>>
>>> @ +    if (nbuf > maxbcache / BKVASIZE)
>>> @ +        nbuf = maxbcache / BKVASIZE;
>>> @ @      /*
>>> @       * swbufs are used as temporary holders for I/O, such as paging
>>> I/O.
>>> @ -     * We have no less then 16 and no more then 256.
>>> @ +     * There must be between 16 and 256 of them.
>>> @       */
>>> @ -    nswbuf = max(min(nbuf/4, 256), 16);
>>> @ +    nswbuf = imax(imin(nbuf / 4, 256), 16);
>>> @  #ifdef NSWBUF_MIN
>>> @      if (nswbuf < NSWBUF_MIN)
>>> @          nswbuf = NSWBUF_MIN;
>>> @  #endif
>>> @ +
>>> @  #ifdef DIRECTIO
>>> @      ffs_rawread_setup();
>>> @ @@ -484,12 +548,12 @@
>>> @ @      /*
>>> @ -     * Reserve space for the buffer cache buffers
>>> @ +     * Reserve space for swbuf headers and buffer cache buffers.
>>> @       */
>>> @ -    swbuf = (void *)v;
>>> @ -    v = (caddr_t)(swbuf + nswbuf);
>>> @ -    buf = (void *)v;
>>> @ -    v = (caddr_t)(buf + nbuf);
>>> @ +    swbuf = va;
>>> @ +    va = swbuf + nswbuf;
>>> @ +    buf = va;
>>> @ +    va = buf + nbuf;
>>> @ @ -    return(v);
>>> @ +    return (va);
>>> @  }
>>> @
>>>
>>> Don't use the caddr_t mistake.  The API was also changed to not use it.
>>>
>>> Bruce
>>>