From owner-freebsd-fs@FreeBSD.ORG  Tue Sep 28 20:25:52 2010
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: fs@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id CB35A106566C
	for <fs@freebsd.org>; Tue, 28 Sep 2010 20:25:52 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from fallbackmx08.syd.optusnet.com.au
	(fallbackmx08.syd.optusnet.com.au [211.29.132.10])
	by mx1.freebsd.org (Postfix) with ESMTP id 4EEB18FC17
	for <fs@freebsd.org>; Tue, 28 Sep 2010 20:25:51 +0000 (UTC)
Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au
	[211.29.132.186])
	by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o8SHq28I030607 for <fs@freebsd.org>; Wed, 29 Sep 2010 03:52:02 +1000
Received: from besplex.bde.org (c122-107-116-249.carlnfd1.nsw.optusnet.com.au
	[122.107.116.249])
	by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o8SHpwQ3002339
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO)
	for <fs@freebsd.org>; Wed, 29 Sep 2010 03:52:00 +1000
Date: Wed, 29 Sep 2010 03:51:58 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: fs@freebsd.org
Message-ID: <20100929031825.L683@besplex.bde.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: 
Subject: ext2fs now extremely slow
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2010 20:25:52 -0000

For benchmarks on ext2fs:

Under FreeBSD-~5.2 rerun today:
untar:     59.17 real
tar:       19.52 real

Under -current run today:
untar:    101.16 real
tar:      172.03 real

So, -current is 8.8 times slower for tar, but only 1.7 times slower for
untar.

FreeBSD-~5.2 is my version of FreeBSD-5.2-CURRENT-old, which has
significant changes in ext2fs which make it a few percent faster (real)
and a few percent slower (sys) by using the BSD buffer cache instead
of a private cache for inodes).  I committed most of my changes to
ext2fs except the ones that made it slower.

More details: the untar benchmark copies about 400 MB of sources from
a large subset of /usr/src to a freshly mkfs.ext2'd and mounted file
system using 2 tars in a pipe (ends up with 488828 1K-blocks used on
ext2fs with 4K-blocks).  The source is supposed to be cached, so that
the untar is almost from memory.  The untar benchmark unmounts the
file system, remounts it, and tars up its contents to /dev/zero.  This
benchmark was originally mainly for finding fs layout problems.  In
fact, it was originally for figuring out why ext2fs was faster than
ffs in 1997 (*).  Since the tar part of it is not much affected by
caching, its results are much easier to reproduce than for the tar
benchmark.  Slowness in it normally that the fs layout is bad, and
that shouldn't happen for a freshly laid out file system.

(*) This turned out to be because the ext2fs layout policy was completely
broken (essentially sequential, ignoring cylinder groups), but this
was actually an optimization for the relatively small file sets tested
by the benchmark (even smaller then), when combined with lack of caching
in my disk drive -- the drive was very slow for even small seeks, and
the broken allocation policy accidentally avoided lots of small seeks,
while ffs's fancier policy tends to generate too many of them.

Rawer results with all relevant possible fs parameters:

FreeBSD-~5.2:
%%%
ext2fs-1024-1024:
tarcp /f srcs:                 68.85 real         0.35 user         7.15 sys
tar cf /dev/zero srcs:         22.36 real         0.15 user         4.90 sys
ext2fs-1024-1024-as:
tarcp /f srcs:                 46.00 real         0.27 user         6.23 sys
tar cf /dev/zero srcs:         22.89 real         0.08 user         4.94 sys
ext2fs-4096-4096:
tarcp /f srcs:                 59.17 real         0.22 user         5.89 sys
tar cf /dev/zero srcs:         19.52 real         0.12 user         2.13 sys
ext2fs-4096-4096-as:
tarcp /f srcs:                 37.73 real         0.22 user         4.94 sys
tar cf /dev/zero srcs:         19.40 real         0.19 user         2.05 sys
%%%

ext2fs-1024-1024 means ext2fs with 1024-blocks and 1024-frags, and the -as
suffix means an async mount, etc.  tarcp is 2 tars in a pipe (untar).

FreeBSD-current:
%%%
ext2fs-1024-1024:
tarcp /f srcs:                130.18 real         0.26 user         6.39 sys
tar cf /dev/zero srcs:         73.90 real         0.15 user         2.30 sys
ext2fs-1024-1024-as:
tarcp /f srcs:                 98.22 real         0.30 user         6.38 sys
tar cf /dev/zero srcs:         70.36 real         0.13 user         2.29 sys
ext2fs-4096-4096:
tarcp /f srcs:                101.16 real         0.33 user         5.04 sys
tar cf /dev/zero srcs:        172.03 real         0.13 user         1.26 sys
ext2fs-4096-4096-as:
tarcp /f srcs:                 78.23 real         0.21 user         5.09 sys
tar cf /dev/zero srcs:        147.87 real         0.15 user         1.23 sys
%%%

The benchmark also prints the i/o counts using mount -v.  This is broken
in -current, so it is not easy to see if there are too many i/o's.

I guess the problem is mainly a bad layout policy, since the efficiency of
the tar step doesn't depend much on the layout.  Testing under ~5.2
confirms this: for the file system left at the end of the above run, but
tarred up by ~5.2 after reboot

%%%
tar cf /dev/zero srcs:        151.88 real         0.14 user         2.30 sys
%%%

So -current is actually 1.03 times faster, not 8.8 times slower, for tar :-/.

dumpe2fs seems to show a bizarre layout:

% Filesystem volume name:   <none>
% Last mounted on:          <not available>
% Filesystem UUID:          a792ae57-2438-4e78-bad6-4ef939fde0df
% Filesystem magic number:  0xEF53
% Filesystem revision #:    1 (dynamic)
% Filesystem features:      filetype sparse_super
% Default mount options:    (none)
% Filesystem state:         not clean
% Errors behavior:          Continue
% Filesystem OS type:       unknown
% Inode count:              1531072
% Block count:              3058374
% Reserved block count:     152918
% Free blocks:              2888113
% Free inodes:              1498688
% First block:              0
% Block size:               4096
% Fragment size:            4096
% Blocks per group:         32768
% Fragments per group:      32768
% Inodes per group:         16288
% Inode blocks per group:   509
% Filesystem created:       Wed Sep 29 02:16:32 2010
% Last mount time:          n/a
% Last write time:          Wed Sep 29 03:15:24 2010
% Mount count:              0
% Maximum mount count:      28
% Last checked:             Wed Sep 29 02:16:32 2010
% Check interval:           15552000 (6 months)
% Next check after:         Mon Mar 28 03:16:32 2011
% Reserved blocks uid:      0 (user root)
% Reserved blocks gid:      0 (group wheel)
% First inode:              11
% Inode size:		  128
% Default directory hash:   tea
% Directory Hash Seed:      036f029e-7924-4a73-91ec-730fd18e832d
% 
% 
% Group 0: (Blocks 0-32767)
%   Primary superblock at 0, Group descriptors at 1-1
%   Block bitmap at 2 (+2), Inode bitmap at 3 (+3)
%   Inode table at 4-512 (+4)
%   0 free blocks, 16277 free inodes, 2 directories
%   Free blocks: 
%   Free inodes: 12-16288
% Group 1: (Blocks 32768-65535)
%   Backup superblock at 32768, Group descriptors at 32769-32769
%   Block bitmap at 32770 (+2), Inode bitmap at 32771 (+3)
%   Inode table at 32772-33280 (+4)
%   0 free blocks, 16288 free inodes, 0 directories
%   Free blocks: 
%   Free inodes: 16289-32576
% Group 2: (Blocks 65536-98303)
%   Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1)
%   Inode table at 65538-66046 (+2)
%   32257 free blocks, 16288 free inodes, 0 directories
%   Free blocks: 66047-98303
%   Free inodes: 32577-48864
% Group 3: (Blocks 98304-131071)
%   Backup superblock at 98304, Group descriptors at 98305-98305
%   Block bitmap at 98306 (+2), Inode bitmap at 98307 (+3)
%   Inode table at 98308-98816 (+4)
%   6882 free blocks, 16288 free inodes, 0 directories
%   Free blocks: 123207, 123209-123215, 123217-123223, 123225-123231, 123233-123239, 123241-123247, ...

The last line was about 15000 characters long, and seems to have the following
pattern except for the first free block:

     1 block used (12208)
     7 blocks free (123209-123215)
     1 block used (12216)
     7 blocks free (123217-123223)
     1 block used ...
     7 blocks free ...

So it seems that only 1 block in every 8 is used, and there is a seek
after every block.  This asks for an 8-fold reduction in throughput,
and it seems to have got that and a bit more for reading although not
for writing.  Even (or especially) with perfect hardware, it must give
an 8-fold reduction.  And it is likely to give more, since it defeats
vfs clustering by making all runs of contiguous blocks have length 1.

Simple sequential allocation should be used unless the allocation policy
and implementation are very good.

Bruce