From owner-freebsd-fs@FreeBSD.ORG  Wed Sep 26 20:06:58 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9A75E16A419
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:06:58 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id 3913613C458
	for <freebsd-fs@freebsd.org>; Wed, 26 Sep 2007 20:06:57 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8QK6TLQ029453
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Thu, 27 Sep 2007 06:06:34 +1000
Date: Thu, 27 Sep 2007 06:06:29 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: "Rick C. Petty" <rick-freebsd@kiwi-computer.com>
In-Reply-To: <20070926171054.GA41567@keira.kiwi-computer.com>
Message-ID: <20070927050547.B60762@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
	<46F3B520.1070708@FreeBSD.org> <fd0edf$7jd$1@sea.gmane.org>
	<20070926030358.GA34186@keira.kiwi-computer.com>
	<20070926171239.E58990@delplex.bde.org>
	<20070926171054.GA41567@keira.kiwi-computer.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@freebsd.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 26 Sep 2007 20:06:58 -0000

On Wed, 26 Sep 2007, Rick C. Petty wrote:

> On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote:
>> On Tue, 25 Sep 2007, Rick C. Petty wrote:
>>
>> That's insignificantly more.  Even doubling the size wouldn't make much
>> difference.  I see differences of at most 25% going the other way and
>
> Some would say that 25% difference is significant.  Obviously you disagree.

No, 25% is significant, but it takes intentional mistuning combined with
no attempt to optimize the mistuned case and bugs for the general case
that are more harmful for the mistuned case to get as much as 25%.

>>     4K blocks, 512-frags -e 512  (broken default):     40MB/S
>>     4K blocks, 512-frags -e 1024 (broken default):     44MB/S
                                      er, fixed default
>>     4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S
>>     4K blocks, 512-frags -e 8192 (try too hard), kernel fixes
>>        (kernel fixes are not complete enough to handle this case;
>>        defaults and -e values which are < the cg size work best except
>>        possibly when the fixes are complete):          45MB/S
>>     16K blocks, 2K-frags -e 2K   (broken default):     50MB/S
>>     16K blocks, 2K-frags -e 4K   (fixed default):      50.5MB/S
>>     16K blocks, 2K-frags -e 8K   (best):               51.5MB/S
>>     16K blocks, 2K-frags -e 64K  (try too hard):       < 51MB/S again
        64K-blocks, 8K-frags -e barely matters             close to max 52 MB/S
 	  (I was able to create a perfectly contiguous (modulo indirect
 	  blocks which were allocated as contiguously as possible) file
 	  of size 1GB on a fs with a cg size of almost 2GB.  A second file
 	  would not have been allocated so well, since it would be started
 	  on the same cg as the directory inode = same cg as the first file.)
>
> Are you talking about throughputs now?  I was just talking about space.
> Time and space are usually mutually-exclusive optimizations.

These are all throughputs starting with a new file system.  Since it's
a new file system with defaults for most parameters, it has the usual
space/ time tuning (-m 8 -o time), but normal space/time tuning doesn't
apply for huge files anyway since there are no normal fragments.

>> ...
>>> size.  You should be able to create 2-4 CGs to span each of your 1TB
>>> drives without increasing the block size and thus minimum allocation unit.
>>
>> In theory it won't work.  From fs.h:
>> ...
>> Only offsets to the inode blocks, etc. are stored in the superblock.
>
> Yes, the offset to the cylinder group block and the offset to the inode
> block are in the superblock (struct fs).  It wouldn't be too difficult to
> tweak the ffs code to read in CGs larger than one block, by checking the
> difference between fs_iblkno and fs_cblkno.  I'm saying it's theoretically
> possible, although it will require tweaks in ffs code.  Again, I think it's
> worth investigating, especially if you believe there are performance
> penalties for having block sizes greater than the kernel buffer size.

But then it won't be binary compatible.

The performance penalties are easier to fix (should just never have existed
on 64-bit platforms).

My main point here is that small cylinder groups alone are not a problem
for large files provided they are not too small.  They cost a few percent
in best cases.  In worst cases, this loss is in the noise.

Bruce