From owner-freebsd-fs@FreeBSD.ORG Tue Sep 28 20:25:52 2010 Return-Path: Delivered-To: fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id CB35A106566C for ; Tue, 28 Sep 2010 20:25:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx08.syd.optusnet.com.au (fallbackmx08.syd.optusnet.com.au [211.29.132.10]) by mx1.freebsd.org (Postfix) with ESMTP id 4EEB18FC17 for ; Tue, 28 Sep 2010 20:25:51 +0000 (UTC) Received: from mail05.syd.optusnet.com.au (mail05.syd.optusnet.com.au [211.29.132.186]) by fallbackmx08.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o8SHq28I030607 for ; Wed, 29 Sep 2010 03:52:02 +1000 Received: from besplex.bde.org (c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249]) by mail05.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o8SHpwQ3002339 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Wed, 29 Sep 2010 03:52:00 +1000 Date: Wed, 29 Sep 2010 03:51:58 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Message-ID: <20100929031825.L683@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: ext2fs now extremely slow X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2010 20:25:52 -0000 For benchmarks on ext2fs: Under FreeBSD-~5.2 rerun today: untar: 59.17 real tar: 19.52 real Under -current run today: untar: 101.16 real tar: 172.03 real So, -current is 8.8 times slower for tar, but only 1.7 times slower for untar. FreeBSD-~5.2 is my version of FreeBSD-5.2-CURRENT-old, which has significant changes in ext2fs which make it a few percent faster (real) and a few percent slower (sys) by using the BSD buffer cache instead of a private cache for inodes). I committed most of my changes to ext2fs except the ones that made it slower. More details: the untar benchmark copies about 400 MB of sources from a large subset of /usr/src to a freshly mkfs.ext2'd and mounted file system using 2 tars in a pipe (ends up with 488828 1K-blocks used on ext2fs with 4K-blocks). The source is supposed to be cached, so that the untar is almost from memory. The untar benchmark unmounts the file system, remounts it, and tars up its contents to /dev/zero. This benchmark was originally mainly for finding fs layout problems. In fact, it was originally for figuring out why ext2fs was faster than ffs in 1997 (*). Since the tar part of it is not much affected by caching, its results are much easier to reproduce than for the tar benchmark. Slowness in it normally that the fs layout is bad, and that shouldn't happen for a freshly laid out file system. (*) This turned out to be because the ext2fs layout policy was completely broken (essentially sequential, ignoring cylinder groups), but this was actually an optimization for the relatively small file sets tested by the benchmark (even smaller then), when combined with lack of caching in my disk drive -- the drive was very slow for even small seeks, and the broken allocation policy accidentally avoided lots of small seeks, while ffs's fancier policy tends to generate too many of them. Rawer results with all relevant possible fs parameters: FreeBSD-~5.2: %%% ext2fs-1024-1024: tarcp /f srcs: 68.85 real 0.35 user 7.15 sys tar cf /dev/zero srcs: 22.36 real 0.15 user 4.90 sys ext2fs-1024-1024-as: tarcp /f srcs: 46.00 real 0.27 user 6.23 sys tar cf /dev/zero srcs: 22.89 real 0.08 user 4.94 sys ext2fs-4096-4096: tarcp /f srcs: 59.17 real 0.22 user 5.89 sys tar cf /dev/zero srcs: 19.52 real 0.12 user 2.13 sys ext2fs-4096-4096-as: tarcp /f srcs: 37.73 real 0.22 user 4.94 sys tar cf /dev/zero srcs: 19.40 real 0.19 user 2.05 sys %%% ext2fs-1024-1024 means ext2fs with 1024-blocks and 1024-frags, and the -as suffix means an async mount, etc. tarcp is 2 tars in a pipe (untar). FreeBSD-current: %%% ext2fs-1024-1024: tarcp /f srcs: 130.18 real 0.26 user 6.39 sys tar cf /dev/zero srcs: 73.90 real 0.15 user 2.30 sys ext2fs-1024-1024-as: tarcp /f srcs: 98.22 real 0.30 user 6.38 sys tar cf /dev/zero srcs: 70.36 real 0.13 user 2.29 sys ext2fs-4096-4096: tarcp /f srcs: 101.16 real 0.33 user 5.04 sys tar cf /dev/zero srcs: 172.03 real 0.13 user 1.26 sys ext2fs-4096-4096-as: tarcp /f srcs: 78.23 real 0.21 user 5.09 sys tar cf /dev/zero srcs: 147.87 real 0.15 user 1.23 sys %%% The benchmark also prints the i/o counts using mount -v. This is broken in -current, so it is not easy to see if there are too many i/o's. I guess the problem is mainly a bad layout policy, since the efficiency of the tar step doesn't depend much on the layout. Testing under ~5.2 confirms this: for the file system left at the end of the above run, but tarred up by ~5.2 after reboot %%% tar cf /dev/zero srcs: 151.88 real 0.14 user 2.30 sys %%% So -current is actually 1.03 times faster, not 8.8 times slower, for tar :-/. dumpe2fs seems to show a bizarre layout: % Filesystem volume name: % Last mounted on: % Filesystem UUID: a792ae57-2438-4e78-bad6-4ef939fde0df % Filesystem magic number: 0xEF53 % Filesystem revision #: 1 (dynamic) % Filesystem features: filetype sparse_super % Default mount options: (none) % Filesystem state: not clean % Errors behavior: Continue % Filesystem OS type: unknown % Inode count: 1531072 % Block count: 3058374 % Reserved block count: 152918 % Free blocks: 2888113 % Free inodes: 1498688 % First block: 0 % Block size: 4096 % Fragment size: 4096 % Blocks per group: 32768 % Fragments per group: 32768 % Inodes per group: 16288 % Inode blocks per group: 509 % Filesystem created: Wed Sep 29 02:16:32 2010 % Last mount time: n/a % Last write time: Wed Sep 29 03:15:24 2010 % Mount count: 0 % Maximum mount count: 28 % Last checked: Wed Sep 29 02:16:32 2010 % Check interval: 15552000 (6 months) % Next check after: Mon Mar 28 03:16:32 2011 % Reserved blocks uid: 0 (user root) % Reserved blocks gid: 0 (group wheel) % First inode: 11 % Inode size: 128 % Default directory hash: tea % Directory Hash Seed: 036f029e-7924-4a73-91ec-730fd18e832d % % % Group 0: (Blocks 0-32767) % Primary superblock at 0, Group descriptors at 1-1 % Block bitmap at 2 (+2), Inode bitmap at 3 (+3) % Inode table at 4-512 (+4) % 0 free blocks, 16277 free inodes, 2 directories % Free blocks: % Free inodes: 12-16288 % Group 1: (Blocks 32768-65535) % Backup superblock at 32768, Group descriptors at 32769-32769 % Block bitmap at 32770 (+2), Inode bitmap at 32771 (+3) % Inode table at 32772-33280 (+4) % 0 free blocks, 16288 free inodes, 0 directories % Free blocks: % Free inodes: 16289-32576 % Group 2: (Blocks 65536-98303) % Block bitmap at 65536 (+0), Inode bitmap at 65537 (+1) % Inode table at 65538-66046 (+2) % 32257 free blocks, 16288 free inodes, 0 directories % Free blocks: 66047-98303 % Free inodes: 32577-48864 % Group 3: (Blocks 98304-131071) % Backup superblock at 98304, Group descriptors at 98305-98305 % Block bitmap at 98306 (+2), Inode bitmap at 98307 (+3) % Inode table at 98308-98816 (+4) % 6882 free blocks, 16288 free inodes, 0 directories % Free blocks: 123207, 123209-123215, 123217-123223, 123225-123231, 123233-123239, 123241-123247, ... The last line was about 15000 characters long, and seems to have the following pattern except for the first free block: 1 block used (12208) 7 blocks free (123209-123215) 1 block used (12216) 7 blocks free (123217-123223) 1 block used ... 7 blocks free ... So it seems that only 1 block in every 8 is used, and there is a seek after every block. This asks for an 8-fold reduction in throughput, and it seems to have got that and a bit more for reading although not for writing. Even (or especially) with perfect hardware, it must give an 8-fold reduction. And it is likely to give more, since it defeats vfs clustering by making all runs of contiguous blocks have length 1. Simple sequential allocation should be used unless the allocation policy and implementation are very good. Bruce