FreeBSD Mail Archives

Date:      Wed, 15 Jul 2015 14:15:17 -0700
From:      Don whY <Don.whY@gmx.com>
To:        freebsd-hackers@freebsd.org
Subject:   Re: format/newfs larger external consumer drives
Message-ID:  <55A6CD65.6060102@gmx.com>
In-Reply-To: <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6%2B0fhKkrhH%2BPQ@mail.gmail.com>
References:  <CAA3ZYrB7i-Cjfv0UX1mb_RPmJdnj2LQw0apDd6%2B0fhKkrhH%2BPQ@mail.gmail.com>

On 7/15/2015 10:37 AM, Dieter BSD wrote:
> [ freebsd-fs@ added ]
>
>>> If the average filesize will be large, use large block/frag sizes.
>>> I use 64 KiB / 8 KiB.  And reduce the number of inodes.  I reduce
>>> inodes as much as newfs allows and there are still way too many.
>>
>> Can you think of an algorithmic way to express this?  I.e., you don't
>> want blocks to get *too* large as you risk greater losses in "partial
>> fragments", etc.  Likewise, you don't want to run out of inodes.
>
> I look at df -i for existing filesystems with similar filesizes.

OK, that makes sense.  A developer could build an artificial dataset
and then see what it looks like; tweek the filesystem structure,
rebuild and see what *that* looks like, etc.  I guess with experience,
it would be relatively easy to select a good starting point for such
iterations...

> My data filesystems usually get an entire disk (..., 2TB, 3TB, recently 5TB)
> and with 64/8 block/frag and as few inodes as newfs will allow
> df still reports numbers like 97% full but only using 0% or 1%
> of inodes.
>
> density reduced from 67108864 to 14860288
> /dev/ada1: 4769307.0MB (9767541168 sectors) block size 65536, fragment size 8192
>          using 1315 cylinder groups of 3628.00MB, 58048 blks, 256 inodes.
>          with soft updates
>
> I should take another look at increasing the size of cylinder groups.
> Newfs likes very small cylinder groups, which made sense 30 years when
> disks were like 40 MB and file sizes were a lot smaller.  IIRC, each
> cylinder group gets at least one block of inodes, and with file sizes
> of 1-20 GB I get way too many inodes.
>
> Yes, a larger frag size will waste some space in the last frag of a file,
> but having smaller block&frag sizes uses a lot of space to keep track of
> all those blocks and frags.  And makes more work for fsck.

So, fsck's effort (and execution *time*) is based *mostly* on inodes?
The "volume" of data is largely unimportant?

>> "risk" of loss/cost of recovery (when the medium
>> *is* unceremoniously dismounted
>
> Some panics don't sync the disks.  Sometimes disks just go into a coma.
> Soft updates is supposed to limit problems to those that fsck -p will
> automagicly fix.  (assuming the disk's write cache is turned off)  There
> is at least one case where it does not.  See PR 166499 (from 2012,
> still not fixed).
>
> As long as I'm whining about unfixed filesystem PRs, see also
> bin/170676: Newfs creates a filesystem that does not pass fsck.
> (also from 2012)
>
>> I am concerned with the fact that users can so easily/carelessly "unplug"
>> a USB device without the proper incantations beforehand.  of course, *their*
>> mistake is seen as a "product design flaw"!  :-/
>
> Superglue the cable in place?  :-)

<frown>  You know *that* will go over like a lead balloon!

> Perhaps print up something like "Unmount filesystem(s) before unplugging
> or powering off external disk, or you might lose your data.",
> laminate it and attach it to the cables?

You'll still get folks who realize their mistake an ohnosecond too late.
And, of course, want to push the blame onto the device:  "Why can't it
make sure I don't do this?  (well, that's obvious!)  Or, at least
give me an option whereby I can plug it in within N seconds and everything
behaves as if I hadn't done that??"  (that, of course, opens even more
cans of worms)

[The intended "user" isn't expected to be a "hacker" -- or even computer.
Hence "appliance" and not "computer system"]

>> The "demo app" that I'm working on is a sort of (low performance) NAS
>> built on SBC's and external drives.
>
> I assume that the drives *have* to be external?  Do they have to be
> usb?  Could they be e-sata?  E-sata is faster and avoids the various usb
> problems.  They used to sell external drives where the sata-to-usb bridge
> was in a separate little module box.  They had alternate modules with
> e-sata, firewire, etc.  The disk box had a standard internal ('L')
> sata connector, except a standard sata connector was too large to fit.
> So I took out my Swiss Army Knife and carved off some plastic from
> the connector on a standard sata cable so that it would fit.
> You could also put a standard sata drive into an enclosure (with
> a small fan) and use your choice of connection to the computer.

Two separate issues, here.  My "demo" -- and the sorts of apps that
other developers are likely to pursue.

In my case, there isn't any real physical room for an internal disk.
I'm using Dell FX160's -- they'll support a SATA laptop drive
and a SATA "memory module" (i.e., this connector is mounted *on*
the SBC and "points straight up").  Power supply is severely limited,
etc.  OTOH, lots of USB ports and the throughput needs are minimal
(I'm only using this to fetch ISO's, etc. when I need to reinstall
some software, drivers/documentation for odd bits of hardware, etc.
The point is to get rid of the piles of CD/DVD media that I've
accumulated over the years)

In the "developer" case, the hardware that they will typically have
available will be more along the lines of SoC's.  Early devices had
"slave/peripheral mode" USB controllers but the more recent offerings
include host support.

So, a developer wishing to integrate something like a disk drive
(magnetic/optical/SSD) into a product offering can usually just
hang it off a USB "connection" -- instead of having to include
a disk controller in his/her design.  This also frees the
developer from getting dragged into the "commodity markup"
mess -- folks know what disks are, what they cost, etc.  So, if
you offer a disk of a particular capacity in your product, they
immediately equate that to the commodity pricing they've seen
for similar "components":  "Wow, that's a helluva lot to pay for
a disk of that size!" or "Well, the disk is worth $X so he's charging
an awful lot for the rest of the device!"

>>> USB specific stuff: There is an off by 1 sector problem, which will
>>> bite you if you switch a drive between using the sata-usb bridge
>>> and connecting the drive directly to a sata controller.  I had to
>>
>> Ouch!  I've not seen that with PATA-USB bridges.  OTOH, I tend not
>> to pull a drive *from* an external enclosure but, rather, rely on
>> the external enclosures to provide portability.  E.g., easier to
>> move 500G of files from machineA to machineB by physically moving
>> the volume containing them!
>
> Apparently they vary, see the message from Warren.  Mine was missing
> the first sector, so I had to have the kernel hunt for the partitioning
> info.

<cringe> That must have been a painful discovery!  ("WTF???")

> The external drives I've seen do not have fans, and have little or
> no ventilation.  If the drive will be spinning for awhile I worry
> about it overheating.

Correct.  I only use external drives for sporadic service.  E.g.,
spin up drive, get/put what I want, then spin it down.  E.g., most of my
largish external drives are 5 or 6 years old and have very few total
hours...

>> The "demo app" will try to use the large multi-TB drives of which I
>> have little long-term experience.  OTOH, the usage model is "fire it
>> up, pull off whichever files you need, then spin everything down"...
>> until the next time you might need to retrieve an ISO (a week later?)
>
> With this usage model it sounds like you could use a read-only mount.
> Would an optical drive work for this application?

No.  Optical media are too small.  The whole point was to get rid of
the spindles of CD/DVD media and put things in a more accessible form.

I have a job that runs on each box that methodically walks through
the contents of the attached media (which might vary from boot to boot)
verifying checksums (which are stored in a RDBMS on another box).
So, in theory, I have a bit of assurance as to whether or not a
particular file is still "intact" (like running an audit on a RAID
array) as well as the integrity of the medium as a whole.

At the same time, another job rsync's changes detected on the monitored
media to any "mirrors" that are also detected as being on-line (perhaps
on a different network node).

The RDBMS's role lets me query it to identify what I want and where it
is likely to be located -- before spinning up any media.  A simple
schema lets me store (ID, name, containerID, size, MD5, etc.) tuples
(so I can even find specific files inside tarballs, ISO's, etc.).

Lastly, the fact that the disks are external, "simple" filesystems
(e.g., not RAID/ZFS) means I can physically move a drive to another
machine and treat it like a "normal" drive -- without having to
worry about rebuilding arrays, etc.

[I initially tried doing this with NTFS volumes.  This would allow
the volumes to be conveniently mounted on Windows machines without
relying on the appliance's role in providing access.  But, "checking"
an NTFS volume of that size tethered via a USB interface takes
AGES!!  (And, I'm not sure how comfortable I am with the current
NTFS support)]

>>> If the drive disappears with filesystem(s) mounted. the kernel might
>>> very well panic.  There was a discussion of this problem recently.
>>> I thought that FUSE was suggested as a possible solution, but I
>>> can't find the discussion.  This problem is not limited to users
>>> disconnecting usb drives without unmounting them.  The problem
>>> happens all by itself with internal drives, as the drive, port
>>> multiplier, controller, or device driver decides to go out to lunch,
>>> and the kernel panics.  This happens *far* too often, and *kills*
>>> reliability.  We really need a solution for this.
>>
>> I think its hard to back-port these sorts of things.  Much easier
>> to consider the possibility of failure when initially designing the
>> system, interfaces, etc.
>
> I wonder how hard it would be to create a FUSE version of FFS?
> Any thoughts from the filesystem wizards?
>
> Alternately, instead of panicing, could the filesystem just
> umount -f the offending filesystem?  (And whine to log(9).)
>
> I am very tired of having an entire machine panic just because
> one disk decided to take a nap.  This is not how you get 5 9s.  :-(

Or, power glitches, firmware bugs, etc.

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55A6CD65.6060102>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation