From owner-freebsd-fs@FreeBSD.ORG Tue May 29 09:41:44 2012 Return-Path: Delivered-To: freebsd-fs@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id E3C1A106566C; Tue, 29 May 2012 09:41:44 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail02.syd.optusnet.com.au (mail02.syd.optusnet.com.au [211.29.132.183]) by mx1.freebsd.org (Postfix) with ESMTP id 0A0A68FC08; Tue, 29 May 2012 09:41:43 +0000 (UTC) Received: from c122-106-171-232.carlnfd1.nsw.optusnet.com.au (c122-106-171-232.carlnfd1.nsw.optusnet.com.au [122.106.171.232]) by mail02.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id q4T9fctV003703 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 29 May 2012 19:41:40 +1000 Date: Tue, 29 May 2012 19:41:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Don Lewis In-Reply-To: <201205290806.q4T86K8M007099@gw.catspoiler.org> Message-ID: <20120529182711.Y1436@besplex.bde.org> References: <201205290806.q4T86K8M007099@gw.catspoiler.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: freebsd-fs@FreeBSD.org, dougb@FreeBSD.org Subject: Re: Millions of small files: best filesystem / best options X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 29 May 2012 09:41:45 -0000 On Tue, 29 May 2012, Don Lewis wrote: > On 29 May, Bruce Evans wrote: >> On Mon, 28 May 2012, Doug Barton wrote: >> >>> On 5/28/2012 10:01 AM, Alessio Focardi wrote: >>>> So in my case I would have to use -b 4096 -f 512 >>>> >>>> It's an improvement, but still is not ideal: still a big waste with 200 bytes files! >>> >>> Are all of the files exactly 200 bytes? If so that's likely the best you >>> can do. >> >> It is easy to do better by using a file system that supports small block >> sizes. This might be slow, but it reduces the wastage. Possible file >> systems: > >> - it is easy to fix ffs to support a minimum block size of 512 (by >> reducing its gratuitous limit of MINBSIZE and fixing the few things >> that break: I realized just after writing this that it doesn't save much space. > That shouldn't be necessary, especially if you newfs with the "-o space" > option to force the fragments for multiple files to be allocated out of > the same block right from the start unstead of waiting to do this once > the filesystem starts getting full. But this may pessimize the allocation even further. Even without -o space, IIRC ffs likes to fill in fragments. It does this even on nearly empty file systems. This tends to give backwards seeks, which drive caches might not handle very well (FreeBSD caches don't even attempt to cache nearby blocks in other files, so for packed small files FreeBSD depends on driver caches for the i/o performance to not be too bad). For example, according to my version of prtblknos: --- % fs_bsize = 8192 % fs_fsize = 1024 % 4: lbn 0 blkno 41 % 5: lbn 0 blkno 42-45 % 6: lbn 0 blkno 64-71 % 7: lbn 0 blkno 46 4: is the inode number of ".". Its data is allocated in the single blkno 41. These blknos are in ffs allocation units (fragments of size fs_fsize = 1024). Note that 41 is not a multiple of 8. It is the second fragment of the ffs block consisting of fragments with blkno's 40-47. Blkno 40 is the first fragment of this block. It is allocated somewhere in "..". After creating ".", I created a 4K file. This has inode 5, and is allocated in the 4 fragments after blkno 41. Then I created an 8K file. This has inode 6. Since its size is >= the block size, it is allocated in the full ffs block consisting of the 8 fragments with blkno's 64-71. Then I created a 512 byte file. This has inode 7. ffs "seeks back" and allocates it in the next free fragment (#46) in the full block 40-47. --- The backwards seeks are worst with mixtures of small and large files. Then reading of a small file typically results in the drive reading all nearby blocks but FreeBSD only reading 1 of these. Then reading a large file causes the blocks near the small file to be discarded from the drive's cache. Then reading a small file causes a seek back to near the first small file and the drive reading all nearby blocks again, an FreeBSD only reading 1 of these again... If ffs didn't seek back like this, then there would always be relatively large gaps between small files and locality would be defeated in another way. Using a block size of 512 results in not really using fragments. The allocation problem is simpler. Then, normally, no gaps are left between related files, unless multiple processes are creating and deleting related files concurrently, and backwards seeks are not needed to read back files that were created sequentially, when the read order is the same as the write order. > I ran a Usenet server this way for quite a while with fairly good > results, though the average file size was a bit bigger, about 2K or so. > I found that if I didn't use "-o space" that space optimization wouldn't > kick in soon enough and I'd tend to run out of full blocks that would be > needed for larger files. The biggest performance problem that I ran > into was that as the directories shrank and grew, they would tend to get > badly fragmented, causing lookups to get slow. This was in the days > before dirhash ... Perhaps FreeBSD ffs now does the backwards seek space optimization more, or I changed it to do so (the above is with my version). I tried changing my version to do the opposite (avoid filling in holes before the current preferred block), but this gave worse results. But I think you just saw a side effect of an old pessimization in ffs block allocation that was fixed about 10 years ago: ffs used to change the preferred block too often (for every directory or something like that, so that directories were allocated far away in another cylinder group). The backwards seeks shouldn't go so far back that they reach another cylinder group. So they will have to go forward more often, and start new blocks, and thus run out of full blocks faster. ffs still has too many cylinder groups, but they are not so harmful provided the block preference doesn't switch between them so often. Bruce