From owner-freebsd-fs@FreeBSD.ORG Wed Sep 29 19:30:25 2010 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 27CD01065670 for ; Wed, 29 Sep 2010 19:30:25 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id E9B368FC15 for ; Wed, 29 Sep 2010 19:30:24 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 9782D46B8B; Wed, 29 Sep 2010 15:30:24 -0400 (EDT) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 6A22B8A03C; Wed, 29 Sep 2010 15:30:23 -0400 (EDT) From: John Baldwin To: Brandon Gooch Date: Wed, 29 Sep 2010 15:30:13 -0400 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20100819; KDE/4.4.5; amd64; ; ) References: <20100929031825.L683@besplex.bde.org> <201009290917.05269.jhb@freebsd.org> In-Reply-To: MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <201009291530.13434.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Wed, 29 Sep 2010 15:30:23 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.6 required=4.2 tests=AWL,BAYES_00 autolearn=ham version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: freebsd-fs@freebsd.org Subject: Re: ext2fs now extremely slow X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 29 Sep 2010 19:30:25 -0000 On Wednesday, September 29, 2010 2:50:16 pm Brandon Gooch wrote: > On Wed, Sep 29, 2010 at 8:17 AM, John Baldwin wrote: > > On Wednesday, September 29, 2010 12:16:55 am Aditya Sarawgi wrote: > >> On Wed, Sep 29, 2010 at 09:14:57AM +1000, Bruce Evans wrote: > >> > On Wed, 29 Sep 2010, Bruce Evans wrote: > >> > > >> > > On Wed, 29 Sep 2010, Bruce Evans wrote: > >> > > > >> > >> For benchmarks on ext2fs: > >> > >> > >> > >> Under FreeBSD-~5.2 rerun today: > >> > >> untar: 59.17 real > >> > >> tar: 19.52 real > >> > >> > >> > >> Under -current run today: > >> > >> untar: 101.16 real > >> > >> tar: 172.03 real > >> > >> > >> > >> So, -current is 8.8 times slower for tar, but only 1.7 times slower for > >> > >> untar. > >> > >> ... > >> > >> So it seems that only 1 block in every 8 is used, and there is a seek > >> > >> after every block. This asks for an 8-fold reduction in throughput, > >> > >> and it seems to have got that and a bit more for reading although not > >> > >> for writing. Even (or especially) with perfect hardware, it must give > >> > >> an 8-fold reduction. And it is likely to give more, since it defeats > >> > >> vfs clustering by making all runs of contiguous blocks have length 1. > >> > >> > >> > >> Simple sequential allocation should be used unless the allocation policy > >> > >> and implementation are very good. > >> > > > >> > > This work a bit better after zapping the 8-fold way: > >> > Things > >> > > ... > >> > > This gives an improvement of: > >> > > > >> > > untar: 101.16 real -> 63.46 > >> > > tar: 172.03 real -> 50.70 > >> > > > >> > > Now -current is only 1.1 times slower for untar and 2.6 times slower for > >> > > tar. > >> > > > >> > > There must be a problem with bpref for things to have been so bad. There > >> > > is some point to leaving a gap of 7 blocks for expansion, but the gap was > >> > > left even between blocks in a single file. > >> > > ... > >> > > I haven't tried the bde_blkpref hack in the above. It should kill bpref > >> > > completely so that there is no jump between lbn0 and lbn1, and break > >> > > cylinder group based allocation even better. Setting bde_blkpref to 1 > >> > > restores the bug that was present in ext2fs in FreeBSD between 1995 and > >> > > 2010. This bug gave seqential allocation starting at the beginning of > >> > > the disk in almost all cases, so map searches were slow and early groups > >> > > filled up before later groups were used at all. > >> > > >> > Tried this (patch repeated below), and it gave essentially the same > >> > speed as old versions. > >> > > >> > The main problem seems to be that the `goal' variables aren't initialized. > >> > After restoring bits verbatim from an old version, things seem to work as > >> > expected: > >> > > >> > % Index: ext2_alloc.c > >> > % =================================================================== > >> > % RCS file: /home/ncvs/src/sys/fs/ext2fs/ext2_alloc.c,v > >> > % retrieving revision 1.2 > >> > % diff -u -2 -r1.2 ext2_alloc.c > >> > % --- ext2_alloc.c 1 Sep 2010 05:34:17 -0000 1.2 > >> > % +++ ext2_alloc.c 28 Sep 2010 21:08:42 -0000 > >> > % @@ -1,2 +1,5 @@ > >> > % +int bde_blkpref = 0; > >> > % +int bde_alloc8 = 0; > >> > % + > >> > % /*- > >> > % * modified for Lites 1.1 > >> > % @@ -117,4 +120,8 @@ > >> > % ext2_alloccg); > >> > % if (bno > 0) { > >> > % + /* set next_alloc fields as done in block_getblk */ > >> > % + ip->i_next_alloc_block = lbn; > >> > % + ip->i_next_alloc_goal = bno; > >> > % + > >> > % ip->i_blocks += btodb(fs->e2fs_bsize); > >> > % ip->i_flag |= IN_CHANGE | IN_UPDATE; > >> > > >> > The only things that changed recently in this block were the 4 deleted > >> > lines and 4 lines with tabs corrupted to spaces. Perhaps an editing > >> > error. > >> > > >> > % @@ -542,6 +549,12 @@ > >> > % then set the goal to what we thought it should be > >> > % */ > >> > % +if (bde_blkpref == 0) { > >> > % if(ip->i_next_alloc_block == lbn && ip->i_next_alloc_goal != 0) > >> > % return ip->i_next_alloc_goal; > >> > % +} else if (bde_blkpref == 1) { > >> > % + if(ip->i_next_alloc_block == lbn) > >> > % + return ip->i_next_alloc_goal; > >> > % +} else > >> > % + return 0; > >> > % > >> > % /* now check whether we were provided with an array that basically > >> > > >> > Not needed now. > >> > > >> > % @@ -662,4 +675,5 @@ > >> > % * block. > >> > % */ > >> > % +if (bde_alloc8 == 0) { > >> > % if (bpref) > >> > % start = dtogd(fs, bpref) / NBBY; > >> > % @@ -679,4 +693,5 @@ > >> > % } > >> > % } > >> > % +} > >> > % > >> > % bno = ext2_mapsearch(fs, bbp, bpref); > >> > > >> > The code to skip to the next 8-block boundary should be removed permanently. > >> > After fixing the initialization, it doesn't generate holes inside files but > >> > it still generates holes between files. The holes are quite large with > >> > 4K-blocks. > >> > > >> > Benchmark results with just the initialization of `goal' variables restored: > >> > > >> > %%% > >> > ext2fs-1024-1024: > >> > tarcp /f srcs: 78.79 real 0.31 user 4.94 sys > >> > tar cf /dev/zero srcs: 24.62 real 0.19 user 1.82 sys > >> > ext2fs-1024-1024-as: > >> > tarcp /f srcs: 52.07 real 0.26 user 4.95 sys > >> > tar cf /dev/zero srcs: 24.80 real 0.10 user 1.93 sys > >> > ext2fs-4096-4096: > >> > tarcp /f srcs: 74.14 real 0.34 user 3.96 sys > >> > tar cf /dev/zero srcs: 33.82 real 0.10 user 1.19 sys > >> > ext2fs-4096-4096-as: > >> > tarcp /f srcs: 53.54 real 0.36 user 3.87 sys > >> > tar cf /dev/zero srcs: 33.91 real 0.14 user 1.15 sys > >> > %%% > >> > > >> > The much larger holes between the files are apparently responsible for the > >> > decreased speed with 4K-blocks. 1K-blocks are really too small, so 4K- blocks > >> > should be faster. > >> > > >> > Benchmark results with the fix and bde_alloc8 = 1. > >> > > >> > ext2fs-1024-1024: > >> > tarcp /f srcs: 71.60 real 0.15 user 2.04 sys > >> > tar cf /dev/zero srcs: 22.34 real 0.05 user 0.79 sys > >> > ext2fs-1024-1024-as: > >> > tarcp /f srcs: 46.03 real 0.14 user 2.02 sys > >> > tar cf /dev/zero srcs: 21.97 real 0.05 user 0.80 sys > >> > ext2fs-4096-4096: > >> > tarcp /f srcs: 59.66 real 0.13 user 1.63 sys > >> > tar cf /dev/zero srcs: 19.88 real 0.07 user 0.46 sys > >> > ext2fs-4096-4096-as: > >> > tarcp /f srcs: 37.30 real 0.12 user 1.60 sys > >> > tar cf /dev/zero srcs: 19.93 real 0.05 user 0.49 sys > >> > > >> > Bruce > >> > >> Hi, > >> > >> I see what you are saying. The gap of 8 block between the files > >> is due to the old preallocation which used to allocate additional > >> 8 blocks in advance for a particular inode when allocating a block > >> for it. The gap between blocks of the same file shouldn't be there > >> too. Both of these cases should be removed. I will look into this > >> during this week. The slowness is also due to lack of preallocation > >> in the new code. > > > > One of the GSoC students worked on a patch to add preallocation back to > > ext2fs this summer. Would you be interested in reviewing and/or testing > > that patch? (I've attached it). Here is his original e-mail: > > > > > > Hi all, > > > > There is a patch in attachment which implements a preallocation > > algorithm in ext2fs. I implement this algorithm during FreeBSD SoC 2010. > > > > This patch implements the in-memory ext2/3 block preallocation algorithm > > from reservation window. It uses a RB-tree to index block allocation > > request and reserve a number of blocks for each file which has requested > > to allocate a block. When a file request to allocate a block, it will > > find a block to allocate to this file. When it find the block to > > allocate, it will try to allocate a block, which is in the same cylinder > > group with inode and is not in other reservation window in RB-tree. > > Meanwhile there are some contiguous free blocks after this block. It > > uses a data structure to store this block's position and the length of > > contiguous free blocks. Then it inserts this data structure into > > RB-tree. When this file request to allocate a block again, It will find > > corresponding data structure in RB-tree. If it can find, the next free > > block will be allocated to this file directly. Otherwise, it will search > > a new block again. > > > > I have run some benchmarks to test this algorithm. Please review it in > > wiki page (' http://wiki.freebsd.org/SOC2010ZhengLiu'). The performance > > is better when the number of threads is smaller than 4. When the number > > of threads is greater than 4, the performance can be increased a little. > > > > Please test it. > > > > > > Thanks and best regards, > > > > lz > > > > Wow, this is really awesome! What are the chances of this code being > committed before a 9.0 release (assuming we have enough user testing)? Good if it gets testing and review. He also worked on read-only support for ext4 (in a second patch). Both patches were posted to this list (fs@) several weeks ago. -- John Baldwin