From owner-freebsd-arch@FreeBSD.ORG Sun Apr 11 02:56:07 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 42128106566C; Sun, 11 Apr 2010 02:56:07 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id C9F968FC1A; Sun, 11 Apr 2010 02:56:06 +0000 (UTC) Received: from c122-106-168-84.carlnfd1.nsw.optusnet.com.au (c122-106-168-84.carlnfd1.nsw.optusnet.com.au [122.106.168.84]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o3B2u23m006829 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sun, 11 Apr 2010 12:56:04 +1000 Date: Sun, 11 Apr 2010 12:56:02 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Andriy Gapon In-Reply-To: <4BBF3C5A.7040009@freebsd.org> Message-ID: <20100411114405.L10562@delplex.bde.org> References: <4BBEE2DD.3090409@freebsd.org> <4BBF3C5A.7040009@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Rick Macklem Subject: Re: (in)appropriate uses for MAXBSIZE X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 11 Apr 2010 02:56:07 -0000 On Fri, 9 Apr 2010, Andriy Gapon wrote: > on 09/04/2010 16:53 Rick Macklem said the following: >> >> >> On Fri, 9 Apr 2010, Andriy Gapon wrote: >> >>> >>> Nowadays several questions could be asked about MAXBSIZE. >>> - Will we have to consider increasing MAXBSIZE? Provided ever >>> increasing media >>> sizes, typical filesystem sizes, typical file sizes (all that >>> multimedia) and >>> even media sector sizes. >> >> I would certainly like to see a larger MAXBSIZE for NFS. Solaris10 >> currently uses 128K as a default I/O size and allows up to 1Mb. Er, the maximum size of buffers in the buffer cache is especially irrelevant for nfs. It is almost irrelevant for physical disks because clustering normally increases the bulk transfer size to MAXPHYS. Clustering takes a lot of CPU but doesn't affect the transfer rate much unless there is not enough CPU. It is even less relevant for network i/o since there is a sort of reverse-clustering -- the buffers get split up into tiny packets (normally 1500 bytes less some header bytes) at the hardware level. Again a lot of CPU is involved doing the (reverse) clustering, and again this doesn't affect the transfer rate much. However, 1500 is so tiny that the reverse-clustering ratio of the i/o size relative to MAXBSIZE (65536/1500) is much smaller than the normal clustering ratio relative to MAXBSIZE (132768/65536) and the extra CPU is more significant for network i/o. (These aren't the actual normal ratios, but ones the limits of the attainable ones by varying only the block sizes under the file system's control.) However2, increasing the network i/o size can make little difference to this problem -- it can only increase the already-too-large reverse-clustering ratio, while possibly reducing other reverse-clustering ratios (the others are for assembling the nfs buffers from local file system buffers; the local file system buffers are normally disassembled from pbuf size (MAXPHYS) to file system size (normally 16K); then conversion to nfs buffers involves either a sort of clustering or reverse clustering depending on the relative sizes of the buffers). There are more gains to be had from increasing the network i/o size. tcp allows larger buffers at intermediate levels but they still get split up at the hardware level. Only some networks allow jumbo frames. >> Using >> larger I/O sizes for NFS is a simpler way to increase bulk data transfer >> rate than more buffers and more agressive read-ahead/write-behind. I'm not sure about that. Read-ahead and write-behind is already very aggressive but seems to be not working right. I use some patches by Bjorn Groenwald (?) which make it work better for the old nfs implemenation (I haven't tried the experimental one). The problems seem to be mainly timing ones. vfs clustering makes the buffer sizes almost irrelevant for physical disks, but there are latency problems for the network i/o. The latency problems seem to be larger for reads than for writes. I get best results by using the same size for network buffers as for local buffers (16K). This avoids 1 layer of buffer size changing (see above) and using 16K-buffers avoids buffer kva fragmentation (see below). I saw little difference from changing the user buffer size, except small buffers tend to work better and smallest (512-byte) buffers may have actually worked best, I think by reducing latencies. > I have lightly tested this under qemu. > I used my avgfs:) modified to issue 4*MAXBSIZE bread-s. > I removed size > MAXBSIZE check in getblk (see a parallel thread "panic: getblk: > size(%d) > MAXBSIZE(%d)"). Did you change the other known things that depend on this? There is the b_pages limit of MAXPHYS bytes which should be checked for in another way, and the soft limits for hibufspace and lobufspace which only matter under load conditions. > And I bumped MAXPHYS to 1MB. > > Some results. > I got no panics, data was read correctly and system remained stable, which is good. > But I observed reading process (dd bs=1m on avgfs) spending a lot of time sleeping > on needsbuffer in getnewbuf. needsbuffer value was VFS_BIO_NEED_ANY. > Apparently there was some shortage of free buffers. > Perhaps some limits/counts were incorrectly auto-tuned. This is not surprising, since even 64K is 4 times too large to work well. Buffer sizes of larger than BKVASIZE (16K) always cause fragmentation of buffer kva. Recovering from fragmentation always takes a lot of CPU, and if you are unlucky it will also take a lot of real time (stalling waiting for free buffer kva). Buffer sizes larger than BKVASIZE also reduce the number of available buffers significantly below the number of buffers configured. This mainly takes a lot of CPU to reconsitute buffers. BKVASIZE being less than MAXBSIZE is a hack to reduce the amount of kva statically allocated for buffers for systems that cannot support enough kva to work right (mainly i386's). It only works well when it is not actually used (when all buffers have size <= BKVASIZE = 16K, as would be enforced by reducing MAXBSIZE to BKVASIZE). This hack and the complications to support it are bogus on systems that support enough kva to work right. nfs buffers larger than 16K would exceed BKVASIZE. This may have been why nfs buffer sizes of size 32K gave negative benefits. Bruce