Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 8 Dec 2015 08:43:33 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Fwd: DELETE support in the VOP_STRATEGY(9)?
Message-ID:  <CANCZdfqHoduhdCss0b6=UsBPAxfRZv4hF8vyuUVLBdP5gYUduQ@mail.gmail.com>
In-Reply-To: <CANCZdfo=NfKy51%2B64-F_v%2BDh2wkrFYP4gXe=X9RWSSao49gO9g@mail.gmail.com>
References:  <CAH7qZftSVAYPmxNCQy=VVRj79AW7z9ade-0iogv2COfo2x%2Ba2Q@mail.gmail.com> <201512052002.tB5K2ZEA026540@chez.mckusick.com> <CAH7qZfs6ksE%2BQTMFFLYxY0PNE4hzn=D5skzQ91=gGK2xvndkfw@mail.gmail.com> <86poyhqsdh.fsf@desk.des.no> <CAH7qZftVj9m_yob=AbAQA7fh8yG-VLgM7H0skW3eX_S%2Bv75E-g@mail.gmail.com> <86fuzdqjwn.fsf@desk.des.no> <CANCZdfo=NfKy51%2B64-F_v%2BDh2wkrFYP4gXe=X9RWSSao49gO9g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
[ forgot to cc hackers ]
---------- Forwarded message ----------
From: Warner Losh <imp@bsdimp.com>
Date: Tue, Dec 8, 2015 at 8:41 AM
Subject: Re: DELETE support in the VOP_STRATEGY(9)?
To: Dag-Erling Sm=C3=B8rgrav <des@des.no>


On Tue, Dec 8, 2015 at 4:06 AM, Dag-Erling Sm=C3=B8rgrav <des@des.no> wrote=
:

> Maxim Sobolev <sobomax@FreeBSD.org> writes:
> > Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
> > > 1) why did you take this off the list?
> > There was a complain from list admin about this being off-topic.
>
> Yes, and Eitan moved the discussion to hackers@.  It should have stayed
> there.
>
> > > 2) why did you even bother to cc: me if you were going to competely
> > > ignore everything I said anyway?
> > I did not really ignore it, it just that I did not have much to reply
> > at that point.  [...] Basically I don't think your concerns wrt DELETE
> > reliability/gurantees have much to do with this particular feature.
> > The reason being that BIO_DELETE essentially tells the storage layer
> > that whichever code "owns" the block in question (e.g.  ZFS or UFS)
> > has moved it into the free pool and will NEVER ever want to read its
> > value back again (until it's written into again).
>
> No, it means that the contents of that block are no longer important and
> that the lower layers *may* reclaim it.  It does not mean that nobody
> will ever try to read the block, nor does it guarantee that the block
> will actually be reclaimed or zeroed.  We cannot rely on the lower
> layers to ensure that reading from a previously deleted block never
> returns data that may have belonged to a different file.
>
> BTW, I've encountered CF cards (including the SanDisk card in my home
> router) that freeze if issued a TRIM command.  Furthermore, many CF, MMC
> and SD cards, especially those marketed for use in digital cameras,
> perform wear leveling "automagically" based on their own understanding
> of the filesystem layout, and will therefore work poorly with anything
> other than FAT (Kingston call it "optimized recording performance" in
> their marketing literature).


While these issues are relevant for BIO_DELETE, they aren't so much relevan=
t
for punching a hole in a file in a filesystem. The filesystem is the one
that
gets to decide whether and when to issue a BIO_DELETE (just as the lower
layers get to decide what to do). A properly written filesystem will not
issue
a BIO_DELETE and then assume it will read back 0's. The whole point of
the punch hole is to allow the filesystem to return the blocks to its free
store. If that also happens to have the effect of causing a BIO_DELETE
to go down, that's no different than deleting the file and having a
BIO_DELETE
go down for the resulting blocks that are freed.


>
> > Technically speaking on 100% correctly working os/hardware attempt to
> > read block after it's been successfully BIO_DELETE'd could produce
> > exception of some sort without any ill effects.
>
> If that were the case, it would never be safe to do
>
> # dd if=3D/dev/da0 of=3D/dev/da1 bs=3D4096 conv=3Dsparse
>
> which I'm sure you'll agree is not acceptable.
>

BIO_DELETE doesn't invalidate the LBA range, just its contents. LBAs
are still required to read afterwards. This matches how the various
standards
dictate what the contents will be after whatever BIO_DELETE turns into.
Maxim is simply wrong about this point, for this and many other reasons.


> > [...] in this particular case of VOP_ALLOCATE(FALLOC_FL_PUNCH_HOLE), a
> > filesystem in question is responsible for making sure the range that
> > has been punched through reads 0, whether by making real logical hole
> > in the file and/or by padding it with zeroes as needed.
>
> Is it really?
>
> Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE:
>
> a) create a filesystem-level hole in the disk image;
> b) perform a), then issue a BIO_DELETE for the blocks that were
>    released;
> c) perform a) or b), then zero the overspill if the requested range is
>    unaligned;
> d) zero the entire range;
> e) perform d) followed by either a) or b);
> f) nothing at all.
>

I don't think f is an option. Unless it is OK to have random contents after
creating a file and seeking some ways into and writing a byte. When you
punch a hole in the file, you should get the same semantics as if you'd
written up to just before the hole originally, then skipped to the end of
the
punched range and written the rest of the file. In Unix, that's well define=
d
to be 0's. It is undefined how those zeros are backed by the filesystem,
or how much storage it takesup. A punch hole operation is a stronger
statement about the contents after the fact than a BIO_DELETE operation.

You are correct, though, that the decision to issue a BIO_DELETE is between
the filesystem and the storage device. This makes a-e possible
implementations,
but some are stupider than others (which ones depend on the situation).
Based on characteristics of both, the filesystem may return the blocks to
its
free store w/o doing anything further (if it frees them up at all). It
could issue a
BIO_DELETE on those blocks, if that is its policy. The device driver for th=
e
lower layers may return an error on the BIO_DELETE request or execute it
faithfully. It cannot rely on it being a faster write zeros to the LBAs
though.
If it wants zeros, it has to write zeros. FreeBSD doesn't provide a way for
the
filesystem to know that the device implements BIO_DELETE as a guarnateed
range of zeros after the operation completes, even if the device tells
FreeBSD
that information today as part of its IDENTIFY or INQUIRY data packets.


> Now, consider the case of the guest OS in a VM issuing TRIM commands to
> the emulated storage controller, which the hypervisor translates into a
> FALLOC_FL_PUNCH_HOLE request for the corresponding range in the disk
> image.  Discuss the advantages and drawbacks of each option I listed
> above for each of the 36 points in the space defined by the following
> axes:
>
> - The disk image is:
>   - a preallocated file on a filesystem (or an md(4) device backed by a
>     preallocated file)
>   - a dynamically allocated file on a filesystem (or an md(4) device
>     backed by an unallocated file)
>   - a zvol
>   - a device
> - The underlying storage's preferred block size is:
>   - small (e.g. 4 kB sectors on an AF drive)
>   - medium (e.g. 64 kB stripes on a RAID)
>   - large (e.g. 1 MB erase blocks on an SSD)
> - The physical storage is:
>   - volatile
>   - solid-state
>   - electromechanical
>
> If you think the answer is the same in all cases, you are deluded.


That's why these decisions are left to the stack. The only semantic that
is required by the punch hole operation is that the filesystem return 0's
on reads to that range. What the filesystem does to ensure this is up to
the filesystem.

As for md translating a BIO_DELETE into a PUNCH_HOLE, that's
an acceptable thing for it to do (assuming we have a punch hole
API). It is a stronger guarantee than is required by the BIO_DELETE
API. However, PUNCH_HOLE should be implemented such that it
is no slower than writes of zeros, and may be faster. Since md is
doing writes of zeros today, this sounds like a possible win for
those filesystems who implement the punch hole operation
more efficiently than writing a block of zeros. And it may also allow
the storage stack the chance to do an optimization that isn't
present today.

Warner



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfqHoduhdCss0b6=UsBPAxfRZv4hF8vyuUVLBdP5gYUduQ>