From owner-freebsd-hackers@freebsd.org  Wed Jul 15 22:06:22 2015
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id BC4A79A3613;
 Wed, 15 Jul 2015 22:06:22 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (gate2.funkthat.com [208.87.223.18])
 (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "gold.funkthat.com", Issuer "gold.funkthat.com" (not verified))
 by mx1.freebsd.org (Postfix) with ESMTPS id 847721B28;
 Wed, 15 Jul 2015 22:06:22 +0000 (UTC)
 (envelope-from jmg@gold.funkthat.com)
Received: from gold.funkthat.com (localhost [127.0.0.1])
 by gold.funkthat.com (8.14.5/8.14.5) with ESMTP id t6FM6LpY072052
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
 Wed, 15 Jul 2015 15:06:21 -0700 (PDT)
 (envelope-from jmg@gold.funkthat.com)
Received: (from jmg@localhost)
 by gold.funkthat.com (8.14.5/8.14.5/Submit) id t6FM6LBS072051;
 Wed, 15 Jul 2015 15:06:21 -0700 (PDT) (envelope-from jmg)
Date: Wed, 15 Jul 2015 15:06:21 -0700
From: John-Mark Gurney <jmg@funkthat.com>
To: Dieter BSD <dieterbsd@gmail.com>
Cc: freebsd-hackers@freebsd.org, freebsd-fs@freebsd.org
Subject: Re: format/newfs larger external consumer drives
Message-ID: <20150715220621.GP8523@funkthat.com>
References: <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6+0fhKkrhH+PQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6+0fhKkrhH+PQ@mail.gmail.com>
X-Operating-System: FreeBSD 9.1-PRERELEASE amd64
X-PGP-Fingerprint: 54BA 873B 6515 3F10 9E88  9322 9CB1 8F74 6D3F A396
X-Files: The truth is out there
X-URL: http://resnet.uoregon.edu/~gurney_j/
X-Resume: http://resnet.uoregon.edu/~gurney_j/resume.html
X-TipJar: bitcoin:13Qmb6AeTgQecazTWph4XasEsP7nGRbAPE
X-to-the-FBI-CIA-and-NSA: HI! HOW YA DOIN? can i haz chizburger?
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.2.7
 (gold.funkthat.com [127.0.0.1]); Wed, 15 Jul 2015 15:06:21 -0700 (PDT)
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 15 Jul 2015 22:06:22 -0000

Dieter BSD wrote this message on Wed, Jul 15, 2015 at 10:37 -0700:
> [ freebsd-fs@ added ]
> 
> >> If the average filesize will be large, use large block/frag sizes.
> >> I use 64 KiB / 8 KiB.  And reduce the number of inodes.  I reduce
> >> inodes as much as newfs allows and there are still way too many.
> >
> > Can you think of an algorithmic way to express this?  I.e., you don't
> > want blocks to get *too* large as you risk greater losses in "partial
> > fragments", etc.  Likewise, you don't want to run out of inodes.
> 
> I look at df -i for existing filesystems with similar filesizes.
> My data filesystems usually get an entire disk (..., 2TB, 3TB, recently 5TB)
> and with 64/8 block/frag and as few inodes as newfs will allow
> df still reports numbers like 97% full but only using 0% or 1%
> of inodes.
> 
> density reduced from 67108864 to 14860288
> /dev/ada1: 4769307.0MB (9767541168 sectors) block size 65536, fragment size 8192
>         using 1315 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.
>         with soft updates
> 
> I should take another look at increasing the size of cylinder groups.

Right now the cg by default is made to fill a block... I don't
believe it can be made larger without a major overhaul of the code...
The default used to be even smaller than a full block causing even
more cg's to be created and you had to do trial and error to figure
out how to make a cg a full block...

> Newfs likes very small cylinder groups, which made sense 30 years when
> disks were like 40 MB and file sizes were a lot smaller.  IIRC, each
> cylinder group gets at least one block of inodes, and with file sizes
> of 1-20 GB I get way too many inodes.

This is partly the default number of inodes are too large... The
current documented default is an inode for every 4 * frag_size bytes
of data space, which isn't correct!!!  This was changed to 2 in
r228794 to keep the number of inodes the same when the transition
from 16k/2k to 32k/4k happened, but the documentation was not
updated...  It has now been updated in r285615 and will be MFC'd...

On my dev server where I have a few source trees checked out:
Filesystem      Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/ada0s2d    185G    122G     48G    72%    2.8M  9.5M   23%   /a

This fs has non-standard config in that my frag size is 8k...  If it was
standard, I'd have twice as many inodes...  Increaseing the frag size
both cuts the # of inodes in half, but also increases the cg size...

Standard:
/dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 4096
        using 307 cylinder groups of 626.09MB, 20035 blks, 80256 inodes.

Non-standard:
/dev/ada0s2d: 192068.0MB (393355264 sectors) block size 32768, fragment size 8192
        using 166 cylinder groups of 1162.97MB, 37215 blks, 74496 inodes.

The other thing I didn't realize (and would be useful for someone to
benchmark) is that many SSD's now use 8k page size instead of the
previous 4k..

Maybe this needs to be more of a sliding scale based upon disk size?
Maybe go from 2 * frag to 4 * frag at fs's larger than 1TB?

Though this is still something that a system admin needs to address,
it's impossible to make the defaults sane for all use cases... There
are some people that will only keep multi GB files on their 5 TB fs,
and so only need a few thousand inodes, but others may keep more
smaller files...

It'd be nice to put together a fs survey to see what sizes of
filesystems people have, and the distribution of files sizes...

I'll try to do that...

> Yes, a larger frag size will waste some space in the last frag of a file,
> but having smaller block&frag sizes uses a lot of space to keep track of
> all those blocks and frags.  And makes more work for fsck.

Yep...

> > "risk" of loss/cost of recovery (when the medium
> > *is* unceremoniously dismounted
> 
> Some panics don't sync the disks.  Sometimes disks just go into a coma.
> Soft updates is supposed to limit problems to those that fsck -p will
> automagicly fix.  (assuming the disk's write cache is turned off)  There
> is at least one case where it does not.  See PR 166499 (from 2012,
> still not fixed).
> 
> As long as I'm whining about unfixed filesystem PRs, see also
> bin/170676: Newfs creates a filesystem that does not pass fsck.
> (also from 2012)
> 
> > I am concerned with the fact that users can so easily/carelessly "unplug"
> > a USB device without the proper incantations beforehand.  of course, *their*
> > mistake is seen as a "product design flaw"!  :-/
> 
> Superglue the cable in place?  :-)
> 
> Perhaps print up something like "Unmount filesystem(s) before unplugging
> or powering off external disk, or you might lose your data.",
> laminate it and attach it to the cables?

Same problem goes for Windows..  They have a policy of turning of
write buffering on pluggable thumb drives to help eliminate this..
For UFS, the sync flag should be provided to mount...

[...]

> Alternately, instead of panicing, could the filesystem just
> umount -f the offending filesystem?  (And whine to log(9).)
> 
> I am very tired of having an entire machine panic just because
> one disk decided to take a nap.  This is not how you get 5 9s.  :-(

There has been lots of work to try to make file systems not panic
when the underlying drives disappear, though clearly more work is
needed...  Patches welcome! :)

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."