Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 08 Dec 2015 12:06:32 +0100
From:      =?utf-8?Q?Dag-Erling_Sm=C3=B8rgrav?= <des@des.no>
To:        Maxim Sobolev <sobomax@FreeBSD.org>
Cc:        Kirk McKusick <mckusick@mckusick.com>, Pawel Jakub Dawidek <pjd@freebsd.org>, Warner Losh <imp@bsdimp.com>, freebsd-hackers@freebsd.org
Subject:   Re: DELETE support in the VOP_STRATEGY(9)?
Message-ID:  <86fuzdqjwn.fsf@desk.des.no>
In-Reply-To: <CAH7qZftVj9m_yob=AbAQA7fh8yG-VLgM7H0skW3eX_S%2Bv75E-g@mail.gmail.com> (Maxim Sobolev's message of "Tue, 8 Dec 2015 00:53:48 -0800")
References:  <CAH7qZftSVAYPmxNCQy=VVRj79AW7z9ade-0iogv2COfo2x%2Ba2Q@mail.gmail.com> <201512052002.tB5K2ZEA026540@chez.mckusick.com> <CAH7qZfs6ksE%2BQTMFFLYxY0PNE4hzn=D5skzQ91=gGK2xvndkfw@mail.gmail.com> <86poyhqsdh.fsf@desk.des.no> <CAH7qZftVj9m_yob=AbAQA7fh8yG-VLgM7H0skW3eX_S%2Bv75E-g@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Maxim Sobolev <sobomax@FreeBSD.org> writes:
> Dag-Erling Sm=C3=B8rgrav <des@des.no> writes:
> > 1) why did you take this off the list?
> There was a complain from list admin about this being off-topic.

Yes, and Eitan moved the discussion to hackers@.  It should have stayed
there.

> > 2) why did you even bother to cc: me if you were going to competely
> > ignore everything I said anyway?
> I did not really ignore it, it just that I did not have much to reply
> at that point.  [...] Basically I don't think your concerns wrt DELETE
> reliability/gurantees have much to do with this particular feature.
> The reason being that BIO_DELETE essentially tells the storage layer
> that whichever code "owns" the block in question (e.g.  ZFS or UFS)
> has moved it into the free pool and will NEVER ever want to read its
> value back again (until it's written into again).

No, it means that the contents of that block are no longer important and
that the lower layers *may* reclaim it.  It does not mean that nobody
will ever try to read the block, nor does it guarantee that the block
will actually be reclaimed or zeroed.  We cannot rely on the lower
layers to ensure that reading from a previously deleted block never
returns data that may have belonged to a different file.

BTW, I've encountered CF cards (including the SanDisk card in my home
router) that freeze if issued a TRIM command.  Furthermore, many CF, MMC
and SD cards, especially those marketed for use in digital cameras,
perform wear leveling "automagically" based on their own understanding
of the filesystem layout, and will therefore work poorly with anything
other than FAT (Kingston call it "optimized recording performance" in
their marketing literature).

> Technically speaking on 100% correctly working os/hardware attempt to
> read block after it's been successfully BIO_DELETE'd could produce
> exception of some sort without any ill effects.

If that were the case, it would never be safe to do

# dd if=3D/dev/da0 of=3D/dev/da1 bs=3D4096 conv=3Dsparse

which I'm sure you'll agree is not acceptable.

> [...] in this particular case of VOP_ALLOCATE(FALLOC_FL_PUNCH_HOLE), a
> filesystem in question is responsible for making sure the range that
> has been punched through reads 0, whether by making real logical hole
> in the file and/or by padding it with zeroes as needed.

Is it really?

Here are a few of our options for implementing FALLOC_FL_PUNCH_HOLE:

a) create a filesystem-level hole in the disk image;
b) perform a), then issue a BIO_DELETE for the blocks that were
   released;
c) perform a) or b), then zero the overspill if the requested range is
   unaligned;
d) zero the entire range;
e) perform d) followed by either a) or b);
f) nothing at all.

Now, consider the case of the guest OS in a VM issuing TRIM commands to
the emulated storage controller, which the hypervisor translates into a
FALLOC_FL_PUNCH_HOLE request for the corresponding range in the disk
image.  Discuss the advantages and drawbacks of each option I listed
above for each of the 36 points in the space defined by the following
axes:

- The disk image is:
  - a preallocated file on a filesystem (or an md(4) device backed by a
    preallocated file)
  - a dynamically allocated file on a filesystem (or an md(4) device
    backed by an unallocated file)
  - a zvol
  - a device
- The underlying storage's preferred block size is:
  - small (e.g. 4 kB sectors on an AF drive)
  - medium (e.g. 64 kB stripes on a RAID)
  - large (e.g. 1 MB erase blocks on an SSD)
- The physical storage is:
  - volatile
  - solid-state
  - electromechanical

If you think the answer is the same in all cases, you are deluded.

DES
--
Dag-Erling Sm=C3=B8rgrav - des@des.no



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?86fuzdqjwn.fsf>