From owner-freebsd-fs@FreeBSD.ORG  Tue May 29 09:41:44 2012
Return-Path: <owner-freebsd-fs@FreeBSD.ORG>
Delivered-To: freebsd-fs@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E3C1A106566C;
	Tue, 29 May 2012 09:41:44 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au
	[211.29.132.183])
	by mx1.freebsd.org (Postfix) with ESMTP id 0A0A68FC08;
	Tue, 29 May 2012 09:41:43 +0000 (UTC)
Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au
	(c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232])
	by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	q4T9fctV003703
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Tue, 29 May 2012 19:41:40 +1000
Date: Tue, 29 May 2012 19:41:38 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Don Lewis <truckman@FreeBSD.org>
In-Reply-To: <201205290806.q4T86K8M007099@gw.catspoiler.org>
Message-ID: <20120529182711.Y1436@besplex.bde.org>
References: <201205290806.q4T86K8M007099@gw.catspoiler.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: freebsd-fs@FreeBSD.org, dougb@FreeBSD.org
Subject: Re: Millions of small files: best filesystem / best options
X-BeenThere: freebsd-fs@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Filesystems <freebsd-fs.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-fs>
List-Post: <mailto:freebsd-fs@freebsd.org>
List-Help: <mailto:freebsd-fs-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-fs>,
	<mailto:freebsd-fs-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 May 2012 09:41:45 -0000

On Tue, 29 May 2012, Don Lewis wrote:

> On 29 May, Bruce Evans wrote:
>> On Mon, 28 May 2012, Doug Barton wrote:
>>
>>> On 5/28/2012 10:01 AM, Alessio Focardi wrote:
>>>> So in my case I would have to use -b 4096 -f 512
>>>>
>>>> It's an improvement, but still is not ideal: still a big waste with 200 bytes files!
>>>
>>> Are all of the files exactly 200 bytes? If so that's likely the best you
>>> can do.
>>
>> It is easy to do better by using a file system that supports small block
>> sizes.  This might be slow, but it reduces the wastage.  Possible file
>> systems:
>
>> - it is easy to fix ffs to support a minimum block size of 512 (by
>>    reducing its gratuitous limit of MINBSIZE and fixing the few things
>>    that break:

I realized just after writing this that it doesn't save much space.

> That shouldn't be necessary, especially if you newfs with the "-o space"
> option to force the fragments for multiple files to be allocated out of
> the same block right from the start unstead of waiting to do this once
> the filesystem starts getting full.

But this may pessimize the allocation even further.  Even without -o space,
IIRC ffs likes to fill in fragments.  It does this even on nearly empty
file systems.  This tends to give backwards seeks, which drive caches
might not handle very well (FreeBSD caches don't even attempt to cache
nearby blocks in other files, so for packed small files FreeBSD depends
on driver caches for the i/o performance to not be too bad).  For example,
according to my version of prtblknos:

---
% fs_bsize = 8192
% fs_fsize = 1024
% 4:	lbn 0 blkno 41
% 5:	lbn 0 blkno 42-45
% 6:	lbn 0 blkno 64-71
% 7:	lbn 0 blkno 46

4: is the inode number of ".".  Its data is allocated in the single blkno
41.  These blknos are in ffs allocation units (fragments of size fs_fsize
= 1024).  Note that 41 is not a multiple of 8.  It is the second fragment
of the ffs block consisting of fragments with blkno's 40-47.  Blkno 40
is the first fragment of this block.  It is allocated somewhere in "..".

After creating ".", I created a 4K file.  This has inode 5, and is
allocated in the 4 fragments after blkno 41.

Then I created an 8K file.  This has inode 6.  Since its size is >= the
block size, it is allocated in the full ffs block consisting of the 8
fragments with blkno's 64-71.

Then I created a 512 byte file.  This has inode 7.  ffs "seeks back" and
allocates it in the next free fragment (#46) in the full block 40-47.
---

The backwards seeks are worst with mixtures of small and large files.
Then reading of a small file typically results in the drive reading all
nearby blocks but FreeBSD only reading 1 of these.  Then reading a large
file causes the blocks near the small file to be discarded from the
drive's cache.  Then reading a small file causes a seek back to near
the first small file and the drive reading all nearby blocks again,
an FreeBSD only reading 1 of these again...

If ffs didn't seek back like this, then there would always be relatively
large gaps between small files and locality would be defeated in another
way.

Using a block size of 512 results in not really using fragments.  The
allocation problem is simpler.  Then, normally, no gaps are left between
related files, unless multiple processes are creating and deleting
related files concurrently, and backwards seeks are not needed to
read back files that were created sequentially, when the read order
is the same as the write order.

> I ran a Usenet server this way for quite a while with fairly good
> results, though the average file size was a bit bigger, about 2K or so.
> I found that if I didn't use "-o space" that space optimization wouldn't
> kick in soon enough and I'd tend to run out of full blocks that would be
> needed for larger files.  The biggest performance problem that I ran
> into was that as the directories shrank and grew, they would tend to get
> badly fragmented, causing lookups to get slow.  This was in the days
> before dirhash ...

Perhaps FreeBSD ffs now does the backwards seek space optimization more,
or I changed it to do so (the above is with my version).  I tried changing
my version to do the opposite (avoid filling in holes before the current
preferred block), but this gave worse results.  But I think you just saw
a side effect of an old pessimization in ffs block allocation that was
fixed about 10 years ago: ffs used to change the preferred block too
often (for every directory or something like that, so that directories
were allocated far away in another cylinder group).  The backwards seeks
shouldn't go so far back that they reach another cylinder group.  So they
will have to go forward more often, and start new blocks, and thus run
out of full blocks faster.  ffs still has too many cylinder groups, but
they are not so harmful provided the block preference doesn't switch
between them so often.

Bruce