From owner-freebsd-hackers@freebsd.org  Tue Dec  8 17:08:05 2015
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id C57859D4D94
 for <freebsd-hackers@mailman.ysv.freebsd.org>;
 Tue,  8 Dec 2015 17:08:05 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: from mail-qg0-x229.google.com (mail-qg0-x229.google.com
 [IPv6:2607:f8b0:400d:c04::229])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 79E71175D
 for <freebsd-hackers@freebsd.org>; Tue,  8 Dec 2015 17:08:05 +0000 (UTC)
 (envelope-from wlosh@bsdimp.com)
Received: by qgcc31 with SMTP id c31so25553557qgc.3
 for <freebsd-hackers@freebsd.org>; Tue, 08 Dec 2015 09:08:04 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=bsdimp-com.20150623.gappssmtp.com; s=20150623;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=9pd2dCJaKQLtecycxOQ0jUVTwzrYrN7IRxQCy0pBHRw=;
 b=dVWP/2Ji6Xgko/KSOxffvNm0uP3zEyQPSzAgKNKb3rpBCw2EQD2H+ccsNsJruHzNRk
 q8PnpNlPF5GSgWZOla+cArpLor5unilEvxgXxjSacZb8h5jY72jE6pVgYLd3/DXdrv5x
 XhSzven9iTAihCBkrFfo8yw46YyagxuJ7aJguOCCFCWjL36NRBQwoHZXaDjbmRfv+NQB
 9Esrm01h30l2M2xMA32SrOL/iLIpGk92NiTYSh3Uc8H1tX+6R8zDayb/jhqoPhZ89C9t
 RrIoCgfqT+yiZDtHQ2Cngsh6t5Wa9/UUhXEHGuyTvEuDu5T5RuZw+kO34bnSu01mTfdJ
 smbw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:mime-version:sender:in-reply-to:references:date
 :message-id:subject:from:to:cc:content-type;
 bh=9pd2dCJaKQLtecycxOQ0jUVTwzrYrN7IRxQCy0pBHRw=;
 b=GVJh0Yf0uugYFNS/M6bZB5oYV8PtvE+nEJtc8bGd6j5FKc1Rq0B0FQkBgfADs/1csT
 8Z0nu9ulGRqp8o/jbg8B8p+ZqQTIyH+OHSwwv0djn5CCgUSuXc06J3YkQ7rXPnFZClhh
 29ty55SPDL4yJsL1v7imYlZ621rr1zNG/OdpZzMPzx9hQOMId3FSoQTm7jJd7b4AM8cR
 Grz6q2H2NZUN9uO3Kn7Jvq4j+5hZB+bttS1BzjQxt0+1Du6GqWBVBy2W35pd8UErsbaQ
 B1alqrqBIK8w+Sk/8pf+s00T4oGKowogdmYbC46EawIlnZf3hnGps8u6M3dWBEoqaIyO
 +rYA==
X-Gm-Message-State: ALoCoQnz5CGKH4ERge5mONZ31d8/D4t7+oFwvPhEZ1GBMYKlmAzTWhIQbeAqNlW1+eSKJFFq1MSkfZVDPKmctUitWh2cWafw5g==
MIME-Version: 1.0
X-Received: by 10.140.40.38 with SMTP id w35mr903462qgw.52.1449594463066; Tue,
 08 Dec 2015 09:07:43 -0800 (PST)
Sender: wlosh@bsdimp.com
Received: by 10.140.27.181 with HTTP; Tue, 8 Dec 2015 09:07:42 -0800 (PST)
X-Originating-IP: [2601:280:4900:3700:4d3f:8eba:ea86:7700]
In-Reply-To: <864mfssxgt.fsf@desk.des.no>
References: <CAH7qZftSVAYPmxNCQy=VVRj79AW7z9ade-0iogv2COfo2x+a2Q@mail.gmail.com>
 <201512052002.tB5K2ZEA026540@chez.mckusick.com>
 <CAH7qZfs6ksE+QTMFFLYxY0PNE4hzn=D5skzQ91=gGK2xvndkfw@mail.gmail.com>
 <86poyhqsdh.fsf@desk.des.no>
 <CAH7qZftVj9m_yob=AbAQA7fh8yG-VLgM7H0skW3eX_S+v75E-g@mail.gmail.com>
 <86fuzdqjwn.fsf@desk.des.no>
 <CANCZdfo=NfKy51+64-F_v+Dh2wkrFYP4gXe=X9RWSSao49gO9g@mail.gmail.com>
 <CANCZdfqHoduhdCss0b6=UsBPAxfRZv4hF8vyuUVLBdP5gYUduQ@mail.gmail.com>
 <864mfssxgt.fsf@desk.des.no>
Date: Tue, 8 Dec 2015 10:07:42 -0700
X-Google-Sender-Auth: -UjaAoO15eSCCqfGcXqKn-VCNUY
Message-ID: <CANCZdfoXdcD+9jeVR1Np16gafBf0_4B2wombwxze8DvJwf7cMg@mail.gmail.com>
Subject: Re: Fwd: DELETE support in the VOP_STRATEGY(9)?
From: Warner Losh <imp@bsdimp.com>
To: =?UTF-8?Q?Dag=2DErling_Sm=C3=B8rgrav?= <des@des.no>
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 08 Dec 2015 17:08:05 -0000

On Tue, Dec 8, 2015 at 9:42 AM, Dag-Erling Sm=C3=B8rgrav <des@des.no> wrote=
:

> Warner Losh <imp@bsdimp.com> writes:
> > Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
> > > Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE:
> > >
> > > a) create a filesystem-level hole in the disk image;
> > > b) perform a), then issue a BIO_DELETE for the blocks that were
> > >    released;
> > > c) perform a) or b), then zero the overspill if the requested range i=
s
> > >    unaligned;
> > > d) zero the entire range;
> > > e) perform d) followed by either a) or b);
> > > f) nothing at all.
> > I don't think f is an option. Unless it is OK to have random contents
> > after creating a file and seeking some ways into and writing a
> > byte. When you punch a hole in the file, you should get the same
> > semantics as if you'd written up to just before the hole originally,
> > then skipped to the end of the punched range and written the rest of
> > the file.
>
> I didn't realize there was a spec, so I didn't know what the intended
> semantics were.


I am assuming the semantics are the same as the Linux thing of the
same name.


> > You are correct, though, that the decision to issue a BIO_DELETE is
> > between the filesystem and the storage device. This makes a-e possible
> > implementations, but some are stupider than others (which ones depend
> > on the situation).
>
> Each of them except f) is the optimal solution for at least one of the
> 36 cases I outlined, or 18 if you ignore the zvol and device points on
> the first axis.


True.


>
> > > Discuss the advantages and drawbacks of each option I listed above
> > > for each of the 36 points in the space defined by the following
> > > axes:
> > > [...]
> > > If you think the answer is the same in all cases, you are deluded.
> > That's why these decisions are left to the stack.
>
> Define "stack".  Do you mean the entire food chain from the hardware to
> the POSIX filesystem API?  By design, no element in the stack has any
> knowledge of any other element, beyond the names and dimensions of its
> immediate consumers and suppliers (I find "producer" ambiguous).


Also true. The stack makes the best choice it can at each level and passes
the rest on down. There are administrative overrides, however, for the
default actions (like sending down a BIO_DELETE).


> > The only semantic that is required by the punch hole operation is that
> > the filesystem return 0's on reads to that range.  What the filesystem
> > does to ensure this is up to the filesystem.
>
> That's easy to say, but each option has advantages and disadvantages
> depending on information which is not necessarily available where it is
> needed.  A filesystem-level hole results in fragmentation, which can
> have a huge performance impact on electromechanical storage but is
> negligible on solid-state storage.


It may result in fragmentation. UFS has techniques to cope, however.
ZFS doesn't matter since it is log based and writing zeros would also
produce a non-local copy.


> But the filesystem does not know
> whether the underlying storage is electromechanical or solid-state, nor
> does it know whether the user cares much about seek times (unless we
> introduce the heuristic "avoid creating holes unless the file already
> has them, in which case the userland probably does not care").


Actually, the filesystem does know. Or has some knowledge of what
is supported and what isn't. BIO_DELETE support is a strong indicator
of a flash or other log-type system.

Then
> again, either the filesystem or the underlying storage *or both* may
> have copy-on-write semantics, in which case zeroing is worse than
> creating a hole.
>

It may have that implementation. That's what administrative controls are
for.


> BTW, writing zeroes to NAND flash does not require erasing the block.  I
> don't know whether SSDs take advantage of that to avoid unnecessarily
> reallocating or erasing a block, nor whether they automatically release
> and erase blocks that end up being completely zeroed.


Turns out this hasn't been true for 4 or 5 generations of NAND.  You cannot
program pages in a block twice. That's not a supported operation, and canno=
t
possibly work for multi-level cell technology. Flash drives never use this
fact
because this fact simply won't work with anything other than first or secon=
d
generation single bit per cell technology. These days, NAND must be written
from first page to last. And it is strongly advised that the DWELL time in
the
erase state be as short as possible. And it is also strongly advised that
the writing be done as quickly as possible and that active counter measure
be taken when you can't write it within a maximum amount of time (such
as garbage collecting data forward to meet the timing constraints). You can
violate these guidelines from time to time, but then the storage retention
cannot be guaranteed to meet vendor specs. The firmware in recent
generations of planar technologies makes certain assumptions to maximize
life of the data and if you violate them (like giving the data time to
decay (NAND
cells are best thought of as tiny capacitors which lose charge over time)
before
the next cells are programmed (disturbing these cells in a predictable
way), you'll
get bit errors). And the new 3-d NAND throws a whole new set of constraints
that are different from these constraints that plagued old planar NAND
arrangements.

SSDs have a log structure under the covers anyway. As you write data, the
data gets laid down into a log. When you write new data, a note of the old
obsolete data is made, but it isn't erased until either the entire block
that
held the data is invalidated, or the data is garbage collected forward to
a new page of NAND in a new erase block. A BIO_DELETE turns into
some flavor of TRIM or DELETE operation under the covers. The drive's
FTL (Flash Translation Layer) then uses this to invalidate blocks in its
internal tables. Maybe this will actually erase the data, maybe not, but
that operation is decoupled from the TRIM completing. The invalidation
occurs both in some kind of 'used mask' as well as in the LBA to physical
mapping so when requests for the LBA come in later all 0's can be
returned (for newer drives, complying with the latest standards).

SD cards are a whole other domain of hacks and optimizations. Since
I don't have direct experience making them and writing that firmware,
I can't comment on what they are, other than to say that telling the
drive the contents are no longer needed, or that you are about to
write multiple blocks can help them out a lot.

As technologies evolve, we should take advantage of them. The old
assumptions, like writing 0's to NAND can be done without erasing,
may prove to be unwise. This is why we let the drive tell us about
their support for TRIMing technologies and let the filesystem decide
when the best time to do that operation is. Sometimes these new
technologies can map to existing facilities, other times they can't.

Warner