From owner-freebsd-hackers@freebsd.org Wed Jul 15 22:06:22 2015 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BC4A79A3613; Wed, 15 Jul 2015 22:06:22 +0000 (UTC) (envelope-from jmg@gold.funkthat.com) Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client CN "gold.funkthat.com", Issuer "gold.funkthat.com" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 847721B28; Wed, 15 Jul 2015 22:06:22 +0000 (UTC) (envelope-from jmg@gold.funkthat.com) Received: from gold.funkthat.com (localhost [127.0.0.1]) by gold.funkthat.com (8.14.5/8.14.5) with ESMTP id t6FM6LpY072052 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Wed, 15 Jul 2015 15:06:21 -0700 (PDT) (envelope-from jmg@gold.funkthat.com) Received: (from jmg@localhost) by gold.funkthat.com (8.14.5/8.14.5/Submit) id t6FM6LBS072051; Wed, 15 Jul 2015 15:06:21 -0700 (PDT) (envelope-from jmg) Date: Wed, 15 Jul 2015 15:06:21 -0700 From: John-Mark Gurney To: Dieter BSD Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org Subject: Re: format/newfs larger external consumer drives Message-ID: <20150715220621.GP8523@funkthat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: FreeBSD 9.1-PRERELEASE amd64 X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88 9322 9CB1 8F74 6D3F A396 X-Files: The truth is out there X-URL: http://resnet.uoregon.edu/~gurney_j/ X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger? User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7 (gold.funkthat.com [127.0.0.1]); Wed, 15 Jul 2015 15:06:21 -0700 (PDT) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 15 Jul 2015 22:06:22 -0000 Dieter BSD wrote this message on Wed, Jul 15, 2015 at 10:37 -0700: > [ freebsd-fs@ added ] > > >> If the average filesize will be large, use large block/frag sizes. > >> I use 64 KiB / 8 KiB. And reduce the number of inodes. I reduce > >> inodes as much as newfs allows and there are still way too many. > > > > Can you think of an algorithmic way to express this? I.e., you don't > > want blocks to get *too* large as you risk greater losses in "partial > > fragments", etc. Likewise, you don't want to run out of inodes. > > I look at df -i for existing filesystems with similar filesizes. > My data filesystems usually get an entire disk (..., 2TB, 3TB, recently 5TB) > and with 64/8 block/frag and as few inodes as newfs will allow > df still reports numbers like 97% full but only using 0% or 1% > of inodes. > > density reduced from 67108864 to 14860288 > /dev/ada1: 4769307.0MB (9767541168 sectors) block size 65536, fragment size 8192 > using 1315 cylinder groups of 3628.00MB, 58048 blks, 256 inodes. > with soft updates > > I should take another look at increasing the size of cylinder groups. Right now the cg by default is made to fill a block... I don't believe it can be made larger without a major overhaul of the code... The default used to be even smaller than a full block causing even more cg's to be created and you had to do trial and error to figure out how to make a cg a full block... > Newfs likes very small cylinder groups, which made sense 30 years when > disks were like 40 MB and file sizes were a lot smaller. IIRC, each > cylinder group gets at least one block of inodes, and with file sizes > of 1-20 GB I get way too many inodes. This is partly the default number of inodes are too large... The current documented default is an inode for every 4 * frag_size bytes of data space, which isn't correct!!! This was changed to 2 in r228794 to keep the number of inodes the same when the transition from 16k/2k to 32k/4k happened, but the documentation was not updated... It has now been updated in r285615 and will be MFC'd... On my dev server where I have a few source trees checked out: Filesystem Size Used Avail Capacity iused ifree %iused Mounted on /dev/ada0s2d 185G 122G 48G 72% 2.8M 9.5M 23% /a This fs has non-standard config in that my frag size is 8k... If it was standard, I'd have twice as many inodes... Increaseing the frag size both cuts the # of inodes in half, but also increases the cg size... Standard: /dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 4096 using 307 cylinder groups of 626.09MB, 20035 blks, 80256 inodes. Non-standard: /dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 8192 using 166 cylinder groups of 1162.97MB, 37215 blks, 74496 inodes. The other thing I didn't realize (and would be useful for someone to benchmark) is that many SSD's now use 8k page size instead of the previous 4k.. Maybe this needs to be more of a sliding scale based upon disk size? Maybe go from 2 * frag to 4 * frag at fs's larger than 1TB? Though this is still something that a system admin needs to address, it's impossible to make the defaults sane for all use cases... There are some people that will only keep multi GB files on their 5 TB fs, and so only need a few thousand inodes, but others may keep more smaller files... It'd be nice to put together a fs survey to see what sizes of filesystems people have, and the distribution of files sizes... I'll try to do that... > Yes, a larger frag size will waste some space in the last frag of a file, > but having smaller block&frag sizes uses a lot of space to keep track of > all those blocks and frags. And makes more work for fsck. Yep... > > "risk" of loss/cost of recovery (when the medium > > *is* unceremoniously dismounted > > Some panics don't sync the disks. Sometimes disks just go into a coma. > Soft updates is supposed to limit problems to those that fsck -p will > automagicly fix. (assuming the disk's write cache is turned off) There > is at least one case where it does not. See PR 166499 (from 2012, > still not fixed). > > As long as I'm whining about unfixed filesystem PRs, see also > bin/170676: Newfs creates a filesystem that does not pass fsck. > (also from 2012) > > > I am concerned with the fact that users can so easily/carelessly "unplug" > > a USB device without the proper incantations beforehand. of course, *their* > > mistake is seen as a "product design flaw"! :-/ > > Superglue the cable in place? :-) > > Perhaps print up something like "Unmount filesystem(s) before unplugging > or powering off external disk, or you might lose your data.", > laminate it and attach it to the cables? Same problem goes for Windows.. They have a policy of turning of write buffering on pluggable thumb drives to help eliminate this.. For UFS, the sync flag should be provided to mount... [...] > Alternately, instead of panicing, could the filesystem just > umount -f the offending filesystem? (And whine to log(9).) > > I am very tired of having an entire machine panic just because > one disk decided to take a nap. This is not how you get 5 9s. :-( There has been lots of work to try to make file systems not panic when the underlying drives disappear, though clearly more work is needed... Patches welcome! :) -- John-Mark Gurney Voice: +1 415 225 5579 "All that I will do, has been done, All that I have, has not."