Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 20 Sep 2020 19:40:16 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: RFC: copy_file_range(3)
Message-ID:  <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA@mail.gmail.com>
In-Reply-To: <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
References:  <CAOtMX2iFZZpoj%2Bap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com> <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com> <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Alan Somers wrote:
> >On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem@uoguelph.ca
> <mailto:rmacklem@uoguelph.ca>> wrote:
> >>Alan Somers wrote:
> >>>copy_file_range(2) is nifty, but it has a few sharp edges:
> >>>1) Certain file systems don't support it, necessitating a write/read
> based
> >>>fallback
> >>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA
> >>>3) It's slightly tricky to both efficiently deal with holes and also
> >>>promptly respond to signals
> >>>
> >>>These problems aren't terribly hard, but it seems to me like most
> >>>applications that use copy_file_range would share the exact same
> >>>solutions.  In particular, I'm thinking about cp(1), dd(1), and
> >>>install(8).  Those three could benefit from sharing a userland wrapper
> that
> >>>handles the above problems.
> >>>
> >>>Should we add such a wrapper to libc?  If so, what should it be called,
> and
> >>>should it be public or just private to /usr/src ?
> >>There has been a discussion on src-committers which I suggested should
> >>be taken to a public mailing list.
> >>
> >>The basic question is...
> >>Whether or not the copy_file_range(2) syscall should be compatible with
> >>the Linux one.
> >>When I did the syscall, I tried to make it Linux-compatible, arguing that
> >>Linux is now a de-facto standard.
> >>The Linux syscall only works on regular files, which is why Alan's patch
> for
> >>cp required a "fallback to the old way" for VCHR files like /dev/null.
> >>
> >>He is considering a wrapper in libc to provide FreeBSD specific
> semantics,
> >>which I have no problem with, so long as the naming and man page make
> >>it clear that it is not compatible with the Linux syscall.
> >>(Personally, I'd prefer a wrapper in libc to making the actual syscall
> non-Linux
> >> compatible, but that is just mho.)
> >>
> >>Hopefully this helps clarify what Alan is asking, rick
> >>
> >>I don't think the two questions are equivalent.  I think that
> copy_file_range(2) >>ought to work on character devices.  Separately, even
> it does, I think a userland >>wrapper would still be useful.  It would
> still be able to handle sparse files more >>efficiently than the
> kernel-based vn_generic_copy_file_range.
> I saw this also stated in your #2 above, but wonder why you think a wrapper
> would handle holes more efficiently.
> vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE
> just like a wrapper would and retains them as far as possible. It also
> looks
> for blocks of all zero bytes for file systems that do not support
> SEEK_DATA/
> SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in
> the output file.
> --> The only cases that I am aware of where the holes are not retained are:
>      - When the min holesize for the output file is larger than that of the
>        input file.
>      - When the hole straddles the byte range specified for the syscall.
>        (Or when the hole straddles two copy_file_range(2) syscalls, if you
>         prefer.)
>
> If you are copying the entire file and do not care how long the syscall
> takes (which also implies how long it will take for a termination signal
> like <ctrl>C to be handled), the most efficient usage is to specify
> a "len" argument equal to UINT64_MAX.
> --> This will usually copy the whole file in one gulp, although it is not
>        guaranteed to copy everything, given the Linux semantics definition
>        of it (an NFSv4.2 server can simply choose to copy less, for
> example).
>        --> This allows the kernel to use whatever block size works
> efficiently
>              and does not require an allocation of a large userspace
> buffer for
>              the date, nor that the data be copied to/from userspace.
>
> The problem with doing the whole file in one gulp are:
> - A large file can take quite a while and any signal won't be processed
> until
>   the gulp is done.
>   --> If you wrote a program that allocated a 100Gbyte buffer and then
>         copied a file using read(2)/write(2) with a size of 100Gbytes in a
> loop,
>         you'd end up with the same result.
> - As kib@ noted, if the input file never reports EOF (as /dev/zero does),
>       then the "one gulp" wouldn't end until storage is exhausted on the
>       output file(s) device and <crtl>C wouldn't stop it (since it is one
> big
>       syscall).
>      --> As such, I suggested that, if the syscall is extended to allow
> VCHR,
>            that the "len" argument be clipped at "K Mbytes" for that case
> to
>            avoid filling the storage device before being able to <ctrl>C
> out
>            of it, for this case.
> I suppose the answer for #3 is...
> - smaller "len" allows for quicker response to signals
> but
> - smaller "len" results in less efficient use of the syscall.
>
> Your patch for "cp" seemed fine, but used a small "len" and, as such,
> made the use of copy_file_range(2) less efficient.
>
> All I see the wrapper dong is handling the VCHR case (if the syscall
> remains
> as it is now and returns EINVAL to be compatible with Linux) and making
> some rather arbitrary choice w.r.t. how big "len" should be.
> --> Choosing an appropriate "len" might better be left to the specific use
>       case, I think?
>
> In summary, it's mostly whether VCHR gets handled by the syscall or a
> wrapper?
>

1) In order to quickly respond to a signal, a program must use a modest len
with copy_file_range
2) If a hole is larger than len, that will cause vn_generic_copy_file_range
to truncate the output file to the middle of the hole.  Then, in the next
invocation, truncate it again to a larger size.
3) The result is a file that is not as sparse as the original.

For example, on UFS:
$ truncate -s 1g sparsefile
$ cp sparsefile sparsefile2
$ du -sh sparsefile*
 96K sparsefile
 32M sparsefile2

My idea for a userland wrapper would solve this problem by using
SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use
copy_file_range for everything else with a modest len.  Alternatively, we
could eliminate the need for the wrapper by enabling copy_file_range for
every file system, and making vn_generic_copy_file_range interruptible, so
copy_file_range can be called with large len without penalizing signal
handling performance.

-Alan



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA>