From owner-freebsd-fs@FreeBSD.ORG Wed Sep 26 20:06:58 2007 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9A75E16A419 for ; Wed, 26 Sep 2007 20:06:58 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 3913613C458 for ; Wed, 26 Sep 2007 20:06:57 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8QK6TLQ029453 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 27 Sep 2007 06:06:34 +1000 Date: Thu, 27 Sep 2007 06:06:29 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: "Rick C. Petty" In-Reply-To: <20070926171054.GA41567@keira.kiwi-computer.com> Message-ID: <20070927050547.B60762@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> <46F3B520.1070708@FreeBSD.org> <20070926030358.GA34186@keira.kiwi-computer.com> <20070926171239.E58990@delplex.bde.org> <20070926171054.GA41567@keira.kiwi-computer.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@freebsd.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 26 Sep 2007 20:06:58 -0000 On Wed, 26 Sep 2007, Rick C. Petty wrote: > On Wed, Sep 26, 2007 at 05:59:24PM +1000, Bruce Evans wrote: >> On Tue, 25 Sep 2007, Rick C. Petty wrote: >> >> That's insignificantly more. Even doubling the size wouldn't make much >> difference. I see differences of at most 25% going the other way and > > Some would say that 25% difference is significant. Obviously you disagree. No, 25% is significant, but it takes intentional mistuning combined with no attempt to optimize the mistuned case and bugs for the general case that are more harmful for the mistuned case to get as much as 25%. >> 4K blocks, 512-frags -e 512 (broken default): 40MB/S >> 4K blocks, 512-frags -e 1024 (broken default): 44MB/S er, fixed default >> 4K blocks, 512-frags -e 2048 (best), kernel fixes: 47MB/S >> 4K blocks, 512-frags -e 8192 (try too hard), kernel fixes >> (kernel fixes are not complete enough to handle this case; >> defaults and -e values which are < the cg size work best except >> possibly when the fixes are complete): 45MB/S >> 16K blocks, 2K-frags -e 2K (broken default): 50MB/S >> 16K blocks, 2K-frags -e 4K (fixed default): 50.5MB/S >> 16K blocks, 2K-frags -e 8K (best): 51.5MB/S >> 16K blocks, 2K-frags -e 64K (try too hard): < 51MB/S again 64K-blocks, 8K-frags -e barely matters close to max 52 MB/S (I was able to create a perfectly contiguous (modulo indirect blocks which were allocated as contiguously as possible) file of size 1GB on a fs with a cg size of almost 2GB. A second file would not have been allocated so well, since it would be started on the same cg as the directory inode = same cg as the first file.) > > Are you talking about throughputs now? I was just talking about space. > Time and space are usually mutually-exclusive optimizations. These are all throughputs starting with a new file system. Since it's a new file system with defaults for most parameters, it has the usual space/ time tuning (-m 8 -o time), but normal space/time tuning doesn't apply for huge files anyway since there are no normal fragments. >> ... >>> size. You should be able to create 2-4 CGs to span each of your 1TB >>> drives without increasing the block size and thus minimum allocation unit. >> >> In theory it won't work. From fs.h: >> ... >> Only offsets to the inode blocks, etc. are stored in the superblock. > > Yes, the offset to the cylinder group block and the offset to the inode > block are in the superblock (struct fs). It wouldn't be too difficult to > tweak the ffs code to read in CGs larger than one block, by checking the > difference between fs_iblkno and fs_cblkno. I'm saying it's theoretically > possible, although it will require tweaks in ffs code. Again, I think it's > worth investigating, especially if you believe there are performance > penalties for having block sizes greater than the kernel buffer size. But then it won't be binary compatible. The performance penalties are easier to fix (should just never have existed on 64-bit platforms). My main point here is that small cylinder groups alone are not a problem for large files provided they are not too small. They cost a few percent in best cases. In worst cases, this loss is in the noise. Bruce