From owner-freebsd-fs@FreeBSD.ORG Fri Sep 21 16:11:42 2007 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id B176616A420 for ; Fri, 21 Sep 2007 16:11:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail04.syd.optusnet.com.au (mail04.syd.optusnet.com.au [211.29.132.185]) by mx1.freebsd.org (Postfix) with ESMTP id 4A38D13C458 for ; Fri, 21 Sep 2007 16:11:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-235-248.carlnfd3.nsw.optusnet.com.au (c220-239-235-248.carlnfd3.nsw.optusnet.com.au [220.239.235.248]) by mail04.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id l8LGBbAn000989 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Sat, 22 Sep 2007 02:11:40 +1000 Date: Sat, 22 Sep 2007 02:11:37 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Ivan Voras In-Reply-To: Message-ID: <20070921230757.Q43374@delplex.bde.org> References: <46F3A64C.4090507@fluffles.net> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org Subject: Re: Writing contigiously to UFS2? X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Sep 2007 16:11:42 -0000 On Fri, 21 Sep 2007, Ivan Voras wrote: > Fluffles wrote: > >> Even worse: data is being stored at weird locations, so that my energy >> efficient NAS project becomes crippled. Even with the first 400GB of data, >> it's storing that on the first 4 disks in my concat configuration, > >> In the past when testing geom_raid5 I've tried to tune newfs parameters so >> that it would write contiguously but still there were regular 2-phase >> writes which mean data was not written contiguously. I really dislike this >> behavior. > > I agree, this is my least favorite aspect of UFS (maybe together with > nonimplementation of extents), for various reasons. I feel it's time to start > heavy lobbying for finishing FreeBSD's implementations of XFS and raiserfs :) Why not improve the implementation of ffs? Cylinder groups are fundamental to ffs, and I think having too-many too-small ones is fairly fundamental, but the allocation policy across them can be anything, and large cylinder groups could be faked using small ones. The current (and very old?) allocation policy for extending files is to consider allocating the block in a new cg when the current cg has more than fs_maxbpg blocks allocated in it (newfs and tunefs parameter -e maxbpg: default 1/4 of the number of blocks in a cg = bpg). Then preference is given to the next cg with more than the average number of free blocks. This seems to be buggy. From ffs_blkpref_ufs1(): % if (indx % fs->fs_maxbpg == 0 || bap[indx - 1] == 0) { % if (lbn < NDADDR + NINDIR(fs)) { % cg = ino_to_cg(fs, ip->i_number); % return (cgbase(fs, cg) + fs->fs_frag); % } I think "indx" here is the index into an array of block pointers in the inode or an indirect block. So for extending large files it is always into an indirect block. It gets reset to 0 for each new indirect block. This makes its use in (indx % fs->fs_maxbpg == 0) dubious. The condition is satisfied whenever: - indx == 0, i.e., always at the start of a new indirect block. Not too bad, but not what we want if fs_maxbpg is much larger than the number of indexes in an indirect block. - index == nonzero multiple of number of indexes in an indirect block. This condition is never be satisfied if fs_maxbpg is larger than the number of indexes in an indirect block. This is the usual case for ffs2 (only 2K indexes in 16K-blocks, and fairly large cg's). On an ffs1 fs that I have handy, maxbpg is 2K and the number of indexes is 4K, so this condition is satisfied once. The (bap[indx - 1] == 0) condition causes a move to a new cg after every hole. This may help by leaving space to fill in the hole, but it is wrong if the hole will never be filled in or is small. This seems to be just a vestige of code that implemented the old rotdelay pessimization. Comments saying that we use fs_maxcontig near here are obviously vestiges of the pessimization. % /* % * Find a cylinder with greater than average number of % * unused data blocks. % */ % if (indx == 0 || bap[indx - 1] == 0) % startcg = % ino_to_cg(fs, ip->i_number) + lbn / fs->fs_maxbpg; At the start of an indirect block, and after a hole, we don't know where the previous block was so we use the cg of the inode advanced by the estimate (lbn / fs->fs_maxbpg) of how far we have advanced from the cg of the inode. I think this estimate is too primitive to work right even a small fraction of the time. Adjustment factors related to the number of maxbpg's per block of indexes and the fullness of the disk seem to be required. Keeping track of the cg of the previous block would be better. % else % startcg = dtog(fs, bap[indx - 1]) + 1; Now there is no problem finding the cg of the previous block. Note that we always add 1... % startcg %= fs->fs_ncg; % avgbfree = fs->fs_cstotal.cs_nbfree / fs->fs_ncg; % for (cg = startcg; cg < fs->fs_ncg; cg++) ... so the search gives maximal non-preference to the cg of the previous block. I think things would work much better if we considered the current cg, if any, first (current cg = one containing previous block), and we actually know that cg. This would be easy to try -- just don't add 1. Also try not adding the bad estimate (lbn / fs->fs_maxbpg), so that the search starts at the inode's cg in some cases -- then previous cg's will be reconsidered but hopefully the average limit will prevent them being used. Note that in the calculation of avgbfree, division by ncg gives a granularity of ncg, so there is an inertia of ncg blocks against moving to the next cg. A too-large ncg is a feature here. BTW, I recently found the bug that broke the allocation policy in FreeBSD's implementation of ext2fs. I thought that the bug was missing code/a too simple implementation (one without a search like the above), but it turned out to be just a bug. The search wasn't set up right, so the current cg was always preferred. Always preferring the current cg tends to give contiguous allocation of data blocks, and this works very well for small file systems, but for large file systems the data blocks end up too far away from inodes (since there is a limited number of inodes per cg and the per-cg inode and data block allocations fill up at different rates. Bruce