Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 15 Jul 2015 15:06:21 -0700
From:      John-Mark Gurney <jmg@funkthat.com>
To:        Dieter BSD <dieterbsd@gmail.com>
Cc:        freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject:   Re: format/newfs larger external consumer drives
Message-ID:  <20150715220621.GP8523@funkthat.com>
In-Reply-To: <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6%2B0fhKkrhH%2BPQ@mail.gmail.com>
References:  <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6%2B0fhKkrhH%2BPQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Dieter BSD wrote this message on Wed, Jul 15, 2015 at 10:37 -0700:
> [ freebsd-fs@ added ]
> 
> >> If the average filesize will be large, use large block/frag sizes.
> >> I use 64 KiB / 8 KiB.  And reduce the number of inodes.  I reduce
> >> inodes as much as newfs allows and there are still way too many.
> >
> > Can you think of an algorithmic way to express this?  I.e., you don't
> > want blocks to get *too* large as you risk greater losses in "partial
> > fragments", etc.  Likewise, you don't want to run out of inodes.
> 
> I look at df -i for existing filesystems with similar filesizes.
> My data filesystems usually get an entire disk (..., 2TB, 3TB, recently 5TB)
> and with 64/8 block/frag and as few inodes as newfs will allow
> df still reports numbers like 97% full but only using 0% or 1%
> of inodes.
> 
> density reduced from 67108864 to 14860288
> /dev/ada1: 4769307.0MB (9767541168 sectors) block size 65536, fragment size 8192
>         using 1315 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.
>         with soft updates
> 
> I should take another look at increasing the size of cylinder groups.

Right now the cg by default is made to fill a block... I don't
believe it can be made larger without a major overhaul of the code...
The default used to be even smaller than a full block causing even
more cg's to be created and you had to do trial and error to figure
out how to make a cg a full block...

> Newfs likes very small cylinder groups, which made sense 30 years when
> disks were like 40 MB and file sizes were a lot smaller.  IIRC, each
> cylinder group gets at least one block of inodes, and with file sizes
> of 1-20 GB I get way too many inodes.

This is partly the default number of inodes are too large... The
current documented default is an inode for every 4 * frag_size bytes
of data space, which isn't correct!!!  This was changed to 2 in
r228794 to keep the number of inodes the same when the transition
from 16k/2k to 32k/4k happened, but the documentation was not
updated...  It has now been updated in r285615 and will be MFC'd...

On my dev server where I have a few source trees checked out:
Filesystem      Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/ada0s2d    185G    122G     48G    72%    2.8M  9.5M   23%   /a

This fs has non-standard config in that my frag size is 8k...  If it was
standard, I'd have twice as many inodes...  Increaseing the frag size
both cuts the # of inodes in half, but also increases the cg size...

Standard:
/dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 4096
        using 307 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.

Non-standard:
/dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 8192
        using 166 cylinder groups of 1162.97MB, 37215 blks, 74496 inodes.

The other thing I didn't realize (and would be useful for someone to
benchmark) is that many SSD's now use 8k page size instead of the
previous 4k..

Maybe this needs to be more of a sliding scale based upon disk size?
Maybe go from 2 * frag to 4 * frag at fs's larger than 1TB?

Though this is still something that a system admin needs to address,
it's impossible to make the defaults sane for all use cases... There
are some people that will only keep multi GB files on their 5 TB fs,
and so only need a few thousand inodes, but others may keep more
smaller files...

It'd be nice to put together a fs survey to see what sizes of
filesystems people have, and the distribution of files sizes...

I'll try to do that...

> Yes, a larger frag size will waste some space in the last frag of a file,
> but having smaller block&frag sizes uses a lot of space to keep track of
> all those blocks and frags.  And makes more work for fsck.

Yep...

> > "risk" of loss/cost of recovery (when the medium
> > *is* unceremoniously dismounted
> 
> Some panics don't sync the disks.  Sometimes disks just go into a coma.
> Soft updates is supposed to limit problems to those that fsck -p will
> automagicly fix.  (assuming the disk's write cache is turned off)  There
> is at least one case where it does not.  See PR 166499 (from 2012,
> still not fixed).
> 
> As long as I'm whining about unfixed filesystem PRs, see also
> bin/170676: Newfs creates a filesystem that does not pass fsck.
> (also from 2012)
> 
> > I am concerned with the fact that users can so easily/carelessly "unplug"
> > a USB device without the proper incantations beforehand.  of course, *their*
> > mistake is seen as a "product design flaw"!  :-/
> 
> Superglue the cable in place?  :-)
> 
> Perhaps print up something like "Unmount filesystem(s) before unplugging
> or powering off external disk, or you might lose your data.",
> laminate it and attach it to the cables?

Same problem goes for Windows..  They have a policy of turning of
write buffering on pluggable thumb drives to help eliminate this..
For UFS, the sync flag should be provided to mount...

[...]

> Alternately, instead of panicing, could the filesystem just
> umount -f the offending filesystem?  (And whine to log(9).)
> 
> I am very tired of having an entire machine panic just because
> one disk decided to take a nap.  This is not how you get 5 9s.  :-(

There has been lots of work to try to make file systems not panic
when the underlying drives disappear, though clearly more work is
needed...  Patches welcome! :)

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150715220621.GP8523>