From owner-freebsd-performance@FreeBSD.ORG  Sun Oct 18 04:44:44 2009
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0FAD7106566C
	for <freebsd-performance@freebsd.org>;
	Sun, 18 Oct 2009 04:44:44 +0000 (UTC)
	(envelope-from freebsd@sopwith.solgatos.com)
Received: from sopwith.solgatos.com
	(pool-98-108-131-11.ptldor.fios.verizon.net [98.108.131.11])
	by mx1.freebsd.org (Postfix) with ESMTP id E3BF28FC0A
	for <freebsd-performance@freebsd.org>;
	Sun, 18 Oct 2009 04:44:42 +0000 (UTC)
Received: by sopwith.solgatos.com (Postfix, from userid 66)
	id 82479B653; Sat, 17 Oct 2009 15:15:20 -0700 (PDT)
Received: from localhost by sopwith.solgatos.com (8.8.8/6.24)
	id EAA21373; Sun, 18 Oct 2009 04:40:38 GMT
Message-Id: <200910180440.EAA21373@sopwith.solgatos.com>
To: freebsd-performance@freebsd.org
In-reply-to: Your message of "Tue, 06 Oct 2009 18:03:16 +1100."
	<20091006174121.V25604@delplex.bde.org> 
Date: Sat, 17 Oct 2009 21:40:38 PDT
From: Dieter <freebsd@sopwith.solgatos.com>
Subject: Re: tuning FFS for large files Re: A specific example of a disk i/o
	problem 
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 18 Oct 2009 04:44:44 -0000

> > I found a clue!  The problem occurs with my big data partitions,
> > which are newfs-ed with options intended to improve things.
> >
> > Reading a large file from the normal ad4s5b partition only delays other
> > commands slightly, as expected.  Reading a large file from the tuned
> > ad4s11 partition yields the delay of minutes for other i/o.
> > ...
> > Here is the newfs command used for creating large data partitions:
> > newfs -e 57984 -b 65536 -f 8192 -g 67108864 -h 16 -i 67108864 -U -o time $partition
> 
> Any block size above the default (16K) tends to thrash and fragment buffer
> cache virtual memory.  This is obviously a good pessimization with lots of
> small files, and apparently, as you have found, it is a good pessimization
> with a few large files too.  I think severe fragmentation can easily take
> several seconds to recover from.  The worst case for causing fragmentaion 
> is probably a mixture of small and large files.

Is there any way to avoid the "thrash and fragment buffercache virtual memory"
problem other than keeping the block size 16K or smaller?

> Some users fear fs consistency bugs with block sizes >= 16K, but I've never
> seen them cause any bugs ecept performance ones.

Yep, many TB of files on filesystems created with above newfs command and
no corruption/consistency problems.

> > And they have way more inodes than needed. (IIRC it doesn't actually
> > use -i 67108864)
> 
> It has to have at least 1 inode per cg, and may as well have a full block
> of them, which gives a fairly large number of inodes especially if the
> block size is too large (64K), so the -i ratio is limited.

I converted a few filesystems to the default.  In addition to losing space,
fsck time went through the roof.  So back to playing with newfs options.

For some reason, larger block/frag sizes allow fewer cylinder groups, which
reduces the number of inodes more than the larger block size increases
it.  From my reading of the newfs man page, -c only allows making
cylinder groups smaller, not larger, and that appears to be the case in
practice.

default:
   newfs -U /dev/ad14s4
   /dev/ad14s4: 431252.6MB (883205320 sectors) block size 16384, fragment size 2048
        using 2348 cylinder groups of 183.72MB, 11758 blks, 23552 inodes.

   Filesystem   1M-blocks    Used  Avail Capacity iused     ifree %iused  Mounted on
   /dev/ad14s4     417678       0 384263     0%       2  55300092    0%   

   fsck -fp:  real    0m37.165s

Attempt to reduce number of inodes:
   newfs -U -i 134217728 -g 134217728 -h 16 -e 261129     /dev/ad14s4
   density reduced from 134217728 to 3676160
   /dev/ad14s4: 431252.6MB (883205320 sectors) block size 16384, fragment size 2048
        using 1923 cylinder groups of 224.38MB, 14360 blks, 64 inodes.

   Filesystem   1M-blocks    Used  Avail Capacity iused     ifree %iused  Mounted on
   /dev/ad14s4     431162       0 396669     0%       2    123068    0%   

   fsck -fp: real    0m32.687s

Bigger block size:
   newfs -U -i 134217728 -g 134217728 -h 16 -e 261129 -b 65536    /dev/ad14s4
   increasing fragment size from 2048 to block size / 8 (8192)
   density reduced from 134217728 to 14860288
   /dev/ad14s4: 431252.6MB (883205312 sectors) block size 65536, fragment size 8192
        using 119 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.

   Filesystem   1M-blocks    Used  Avail Capacity iused     ifree %iused  Mounted on
   /dev/ad14s4     431230       0 396731     0%       2     30460    0%   

   fsck -fp:  real    0m3.144s

Bigger block size and bigger frag size:
   newfs -U -i 134217728 -g 134217728 -h 16 -e 261129 -b 65536 -f 65536   /dev/ad14s4
   density reduced from 134217728 to 66846720
   /dev/ad14s4: 431252.6MB (883205248 sectors) block size 65536, fragment size 65536
        using 27 cylinder groups of 16320.56MB, 261129 blks, 512 inodes.

   Filesystem   1M-blocks    Used  Avail Capacity iused     ifree %iused  Mounted on
   /dev/ad14s4     431245       0 396745     0%       2     13820    0%   

   fsck -fp:  real    0m0.369s

With -b 65536 -f 65536 I'm finally approaching a reasonable number of inodes
(even less would be better).  The fsck time varies by a factor of over 100,
and results are roughly similar on filesystems with files in them.