From owner-freebsd-fs@FreeBSD.ORG Mon May 28 16:25:25 2012 Return-Path: Delivered-To: freebsd-fs@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 952061065672 for ; Mon, 28 May 2012 16:25:25 +0000 (UTC) (envelope-from Devin.Teske@fisglobal.com) Received: from mx1.fisglobal.com (mx1.fisglobal.com [199.200.24.190]) by mx1.freebsd.org (Postfix) with ESMTP id 594B38FC21 for ; Mon, 28 May 2012 16:25:25 +0000 (UTC) Received: from pps.filterd (ltcfislmsgpa03 [127.0.0.1]) by ltcfislmsgpa03.fnfis.com (8.14.4/8.14.4) with SMTP id q4SGPImr004827; Mon, 28 May 2012 11:25:18 -0500 Received: from smtp.fisglobal.com ([10.132.206.31]) by ltcfislmsgpa03.fnfis.com with ESMTP id 154b3vrcrp-1 (version=TLSv1/SSLv3 cipher=AES128-SHA bits=128 verify=NOT); Mon, 28 May 2012 11:25:18 -0500 Received: from [10.0.0.105] (10.14.152.61) by smtp.fisglobal.com (10.132.206.31) with Microsoft SMTP Server (TLS) id 14.2.283.3; Mon, 28 May 2012 11:24:49 -0500 MIME-Version: 1.0 (Apple Message framework v1257) From: Devin Teske In-Reply-To: <2134924725.5040.1338211317460.JavaMail.root@zimbra.interconnessioni.it> Date: Mon, 28 May 2012 09:24:51 -0700 Content-Transfer-Encoding: quoted-printable Message-ID: <922B261C-4AB8-49A9-96CE-16C98B265604@fisglobal.com> References: <2134924725.5040.1338211317460.JavaMail.root@zimbra.interconnessioni.it> To: Alessio Focardi X-Mailer: Apple Mail (2.1257) X-Originating-IP: [10.14.152.61] X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:5.6.7580, 1.0.260, 0.0.0000 definitions=2012-05-28_02:2012-05-21, 2012-05-28, 1970-01-01 signatures=0 Content-Type: text/plain; charset="windows-1252" Cc: freebsd-fs@freebsd.org Subject: Re: Millions of small files: best filesystem / best options X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: Devin Teske List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 May 2012 16:25:25 -0000 On May 28, 2012, at 6:21 AM, Alessio Focardi wrote: > Hi, >=20 > I'm pretty new to BSD, but I do have some knowledge in Linux.=20 >=20 > I'm looking for some advice to efficiently pack millions of small files (= 200 bytes or less) over a freebsd fs. >=20 This is something we've been doing (on FreeBSD) for almost 15 years now (st= arting with FreeBSD 2.1.5; now 8.1, and soon 8.3). We started with UFS1 and= have been evaluating ZFS (we don't think SU+J is ready for production at t= his scale yet). We haven't used UFS2 yet but have no doubt that it's just a= s strong as UFS1. > Those files will be stored in an hierarchical directory structure to limi= t the number of files for any directory and so (I hope!) speed up file look= ups/deletion. >=20 FreeBSD handles this wonderfully thanks to all the people that have put in = time and effort over the years. Ten years ago (circa FreeBSD 4.0-RELEASE) people at the company I work at n= ow, back then commonly: - fiddled with the dirhash sysctl(8) MIB - modified fsck(8) to make it more efficient - modified tar(1) to handle high numbers of hard-links without falling over - modified du(1) in a similar fashion to tar above - more; all in the name of doing what you're describing (but on steroids) but all those patches eventually made their way back into FreeBSD and we ge= nerally haven't had to worry about even tens-of-millions of JPEG-sized (~20= 0KB) files on a RAID formatted in UFS (1 or 2) since, say, FreeBSD-6 (but s= omeone in FS will be able to give a more accurate release when things reall= y started to stabilize). Either way, 6, 7, 8, and 9 all had very stable fil= esystems w/respect to millions-of-small-files. > I have to say that I'm looking at fbsd for my project because both UFS2 a= nd ZFS have some flavour of "block suballocation" "tail packing" "variable = record size", at least documentation says so. >=20 > My hope is to waste as less space as possible, even sacrificing some spee= d: can't use a full block for a single file: I will end up wasting 99% of t= he space! >=20 I wasn't aware that FreeBSD was unique in this respect, but yes, FreeBSD ha= s a block size and a fragment size. While formatting a UFS filesystem you c= an specify these sizes with the "-b SIZE" and "-f SIZE" arguments to newfs(= 8), for example: newfs -b 16384 -f 2048 /dev/da0s1a Will format a RAID (/dev/da0s1a) with a 16K block size but a 2K fragment si= ze. Using touch(1) to create an empty file will use only 2K of disk space. = This is the "block suballocation" you speak of. The above parameters are ex= actly what we use formatting our RAIDs when storing millions of JPEG-sized = (~200KB as you describe) files. >=20 > Do someone got some experience in a similar situation, and it's willing t= o give some advice on which fs I should choose and how to tune it for this = particular scenario? >=20 Choose your hardware wisely. After you have chosen your hardware wisely, se= t it up even more wisely. For example, we go threw a multi-day burn-in process on RAIDs that have dou= ble-digit numbers of disks. Be smart about how you allocate the logical versus physical media in a way = that reduces bottlenecks. Go through any/all failure/recovery test procedures before putting data on = the device if you don't already trust the hardware. Trust in the hardware i= s very important. If you don't trust your hardware's battery backed DIMM fo= r write-back cache (for example), then I have one very important recommenda= tion when it comes to UFS: disable the SoftUpdates feature. Disabling SoftUpdates on a UFS filesystem cause a huge performance impact b= ut it will allow you to sleep at night. In 15 years, UFS has never barfed o= n us unless maybe 3 memorable events in which entire groups-of-individuals = can recount with amazing clarity debugging horked filesystems late in the n= ight after SoftUpdates ate the kid's homework (leaving tens- to hundreds-of= -thousands of files in lost+found). We routinely use SoftUpdates on _other_= UFS filesystems (like system partitions including "/var" and "/usr"), but = _never_ on the RAIDs housing those millions-of-little-files. Other's mileage may vary. >=20 > Thank you very much, appreciated! >=20 >=20 No problem. > ps >=20 > I know that probably a database will fit better in this situation, but in= my case I can't take that route :( >=20 Not necessarily. A database has the immediate-and-clear down-side that if o= ne bit in the database changes, a backup tool like bacula has to backup the= entire database again. =85and the database administrator is not necessarily the same person as the= backup administrator (just sayin'). --=20 Devin _____________ The information contained in this message is proprietary and/or confidentia= l. If you are not the intended recipient, please: (i) delete the message an= d all copies; (ii) do not disclose, distribute or use the message in any ma= nner; and (iii) notify the sender immediately. In addition, please be aware= that any message addressed to our domain is subject to archiving and revie= w by persons other than the intended recipient. Thank you.