From owner-freebsd-fs@FreeBSD.ORG  Fri Sep 21 16:11:42 2007
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id B176616A420
	for <freebsd-fs@FreeBSD.org>; Fri, 21 Sep 2007 16:11:42 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au
	[211.29.132.185])
	by mx1.freebsd.org (Postfix) with ESMTP id 4A38D13C458
	for <freebsd-fs@FreeBSD.org>; Fri, 21 Sep 2007 16:11:42 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au
	(c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248])
	by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l8LGBbAn000989
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Sat, 22 Sep 2007 02:11:40 +1000
Date: Sat, 22 Sep 2007 02:11:37 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Ivan Voras <ivoras@FreeBSD.org>
In-Reply-To: <fd0aaj$poh$1@sea.gmane.org>
Message-ID: <20070921230757.Q43374@delplex.bde.org>
References: <46F3A64C.4090507@fluffles.net> <fd0aaj$poh$1@sea.gmane.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org
Subject: Re: Writing contigiously to UFS2?
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Sep 2007 16:11:42 -0000

On Fri, 21 Sep 2007, Ivan Voras wrote:

> Fluffles wrote:
>
>> Even worse: data is being stored at weird locations, so that my energy 
>> efficient NAS project becomes crippled. Even with the first 400GB of data, 
>> it's storing that on the first 4 disks in my concat configuration, 
>
>> In the past when testing geom_raid5 I've tried to tune newfs parameters so 
>> that it would write contiguously but still there were regular 2-phase 
>> writes which mean data was not written contiguously. I really dislike this 
>> behavior.
>
> I agree, this is my least favorite aspect of UFS (maybe together with 
> nonimplementation of extents), for various reasons. I feel it's time to start 
> heavy lobbying for finishing FreeBSD's implementations of XFS and raiserfs :)

Why not improve the implementation of ffs?  Cylinder groups are
fundamental to ffs, and I think having too-many too-small ones is
fairly fundamental, but the allocation policy across them can be
anything, and large cylinder groups could be faked using small ones.

The current (and very old?) allocation policy for extending files is
to consider allocating the block in a new cg when the current cg has
more than fs_maxbpg blocks allocated in it (newfs and tunefs parameter
-e maxbpg: default 1/4 of the number of blocks in a cg = bpg).  Then
preference is given to the next cg with more than the average number
of free blocks.  This seems to be buggy.  From ffs_blkpref_ufs1():

% 	if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) {
% 		if (lbn < NDADDR + NINDIR(fs)) {
% 			cg = ino_to_cg(fs, ip->i_number);
% 			return (cgbase(fs, cg) + fs->fs_frag);
% 		}

I think "indx" here is the index into an array of block pointers in
the inode or an indirect block.  So for extending large files it is
always into an indirect block.  It gets reset to 0 for each new indirect
block.  This makes its use in (indx % fs->fs_maxbpg == 0) dubious.
The condition is satisfied whenever:
- indx == 0, i.e., always at the start of a new indirect block.  Not
   too bad, but not what we want if fs_maxbpg is much larger than the
   number of indexes in an indirect block.
- index == nonzero multiple of number of indexes in an indirect block.
   This condition is never be satisfied if fs_maxbpg is larger than
   the number of indexes in an indirect block.  This is the usual case
   for ffs2 (only 2K indexes in 16K-blocks, and fairly large cg's).  On
   an ffs1 fs that I have handy, maxbpg is 2K and the number of indexes
   is 4K, so this condition is satisfied once.

The (bap[indx - 1] == 0) condition causes a move to a new cg after
every hole.  This may help by leaving space to fill in the hole, but
it is wrong if the hole will never be filled in or is small.  This
seems to be just a vestige of code that implemented the old rotdelay
pessimization.  Comments saying that we use fs_maxcontig near here
are obviously vestiges of the pessimization.

% 		/*
% 		 * Find a cylinder with greater than average number of
% 		 * unused data blocks.
% 		 */
% 		if (indx == 0 || bap[indx - 1] == 0)
% 			startcg =
% 			    ino_to_cg(fs, ip->i_number) + lbn / fs->fs_maxbpg;

At the start of an indirect block, and after a hole, we don't know where
the previous block was so we use the cg of the inode advanced by the
estimate (lbn / fs->fs_maxbpg) of how far we have advanced from the cg
of the inode.  I think this estimate is too primitive to work right even
a small fraction of the time.  Adjustment factors related to the number
of maxbpg's per block of indexes and the fullness of the disk seem to
be required.  Keeping track of the cg of the previous block would be better.

% 		else
% 			startcg = dtog(fs, bap[indx - 1]) + 1;

Now there is no problem finding the cg of the previous block.  Note that
we always add 1...

% 		startcg %= fs->fs_ncg;
% 		avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg;
% 		for (cg = startcg; cg < fs->fs_ncg; cg++)

... so the search gives maximal non-preference to the cg of the previous
block.  I think things would work much better if we considered the
current cg, if any, first (current cg = one containing previous block), and
we actually know that cg.  This would be easy to try -- just don't add 1.
Also try not adding the bad estimate (lbn / fs->fs_maxbpg), so that the
search starts at the inode's cg in some cases -- then previous cg's will
be reconsidered but hopefully the average limit will prevent them being
used.

Note that in the calculation of avgbfree, division by ncg gives a
granularity of ncg, so there is an inertia of ncg blocks against moving
to the next cg.  A too-large ncg is a feature here.

BTW, I recently found the bug that broke the allocation policy in FreeBSD's
implementation of ext2fs.  I thought that the bug was missing code/a too
simple implementation (one without a search like the above), but it turned
out to be just a bug.  The search wasn't set up right, so the current cg
was always preferred.  Always preferring the current cg tends to give
contiguous allocation of data blocks, and this works very well for small
file systems, but for large file systems the data blocks end up too far
away from inodes (since there is a limited number of inodes per cg and
the per-cg inode and data block allocations fill up at different rates.

Bruce