Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 28 Sep 2020 01:43:18 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Chris Stephan <chris.stephan@live.com>, Alan Somers <asomers@freebsd.org>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: RFC: copy_file_range(3)
Message-ID:  <YTBPR01MB396664A1DAE4A2742C22385ADD350@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <YTBPR01MB3966320580FC5D659F0911D1DD370@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
References:  <CAOtMX2iFZZpoj%2Bap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com> <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com> <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA@mail.gmail.com>, <YTBPR01MB396622BAC24ECA15F5421678DD3A0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <SN6PR02MB5487E40F82CC231B5E63A7E89B370@SN6PR02MB5487.namprd02.prod.outlook.com>, <YTBPR01MB3966320580FC5D659F0911D1DD370@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
I have implemented the 1second timeout and put it up
on phabricator as D26571.
If anyone wishes to review, please do so.
There are also D26569 and D26570, which are a couple
of fixes I came up with during testing.

These patches do not address the "should it be Linux
compatible", which is being discussed on another thread.

Thanks for the suggestion w.r.t using a time limit, rick

________________________________________
From: owner-freebsd-hackers@freebsd.org <owner-freebsd-hackers@freebsd.org>=
 on behalf of Rick Macklem <rmacklem@uoguelph.ca>
Sent: Saturday, September 26, 2020 7:22 PM
To: Chris Stephan; Alan Somers
Cc: FreeBSD Hackers
Subject: Re: RFC: copy_file_range(3)

Chris Stephan wrote:
> New to the list and Late to the discussion. I am thinking increasing the =
Len could
> cause possible degradation of performance when used on slower or legacy
> systems. On the other hand just picking a len and cutting it off at a har=
d max
> seems crude even with a tunable. Admittedly my naive opinion in this matt=
er
> ponders, could there be a sysctl tunable that just sets an estimate on ti=
meframe
> instead of size. As you said 100M is roughly a sec on modem hardware. IOP=
S are
> already tracked. The inverse of the avg IOPS for the filesystem in questi=
on could
> be used against this tunable to derive the estimated size limit of the ne=
xt
> read/write. This would allow the max len within the syscall to both honor=
 a
> timeframe before a signal would be handled and maximize efficiency across=
 a
> large breadth of systems varying in performance. I=92m sure it is more co=
mplicated
> than I suggest... just tossing my 2c in.
Yes. Using time will work for the generic copy case and I think that's a go=
od idea.
Then we can leave the file system specific cases up to the implementors.
(For NFSv4.2, once you send the RPC to the server, the client has no contro=
l over
 how long it takes to reply. The current sysctl that sets a size is still a=
bout all the
 NFSv4.2 code can do.)

Thanks for the suggestion, rick

Chris

Sent from FreeBSD
________________________________
From: owner-freebsd-hackers@freebsd.org <owner-freebsd-hackers@freebsd.org>=
 on behalf of Rick Macklem <rmacklem@uoguelph.ca>
Sent: Sunday, September 20, 2020 11:28:21 PM
To: Alan Somers <asomers@freebsd.org>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject: Re: RFC: copy_file_range(3)

[I have only indented your most recent comments]
Alan Somers wrote:
On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem <rmacklem@uoguelph.ca<mailto:r=
macklem@uoguelph.ca>> wrote:
Alan Somers wrote:
>On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem@uoguelph.ca<mailto:=
rmacklem@uoguelph.ca><mailto:rmacklem@uoguelph.ca<mailto:rmacklem@uoguelph.=
ca>>> wrote:
>>Alan Somers wrote:
>>>copy_file_range(2) is nifty, but it has a few sharp edges:
>>>1) Certain file systems don't support it, necessitating a write/read bas=
ed
>>>fallback
>>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA
>>>3) It's slightly tricky to both efficiently deal with holes and also
>>>promptly respond to signals
>>>
>>>These problems aren't terribly hard, but it seems to me like most
>>>applications that use copy_file_range would share the exact same
>>>solutions.  In particular, I'm thinking about cp(1), dd(1), and
>>>install(8).  Those three could benefit from sharing a userland wrapper t=
hat
>>>handles the above problems.
>>>
>>>Should we add such a wrapper to libc?  If so, what should it be called, =
and
>>>should it be public or just private to /usr/src ?
>>There has been a discussion on src-committers which I suggested should
>>be taken to a public mailing list.
>>
>>The basic question is...
>>Whether or not the copy_file_range(2) syscall should be compatible with
>>the Linux one.
>>When I did the syscall, I tried to make it Linux-compatible, arguing that
>>Linux is now a de-facto standard.
>>The Linux syscall only works on regular files, which is why Alan's patch =
for
>>cp required a "fallback to the old way" for VCHR files like /dev/null.
>>
>>He is considering a wrapper in libc to provide FreeBSD specific semantics=
,
>>which I have no problem with, so long as the naming and man page make
>>it clear that it is not compatible with the Linux syscall.
>>(Personally, I'd prefer a wrapper in libc to making the actual syscall no=
n-Linux
>> compatible, but that is just mho.)
>>
>>Hopefully this helps clarify what Alan is asking, rick
>>
>>I don't think the two questions are equivalent.  I think that copy_file_r=
ange(2) >>ought to work on character devices.  Separately, even it does, I =
think a userland >>wrapper would still be useful.  It would still be able t=
o handle sparse files more >>efficiently than the kernel-based vn_generic_c=
opy_file_range.
I saw this also stated in your #2 above, but wonder why you think a wrapper
would handle holes more efficiently.
vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE
just like a wrapper would and retains them as far as possible. It also look=
s
for blocks of all zero bytes for file systems that do not support SEEK_DATA=
/
SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in
the output file.
--> The only cases that I am aware of where the holes are not retained are:
     - When the min holesize for the output file is larger than that of the
       input file.
     - When the hole straddles the byte range specified for the syscall.
       (Or when the hole straddles two copy_file_range(2) syscalls, if you
        prefer.)

If you are copying the entire file and do not care how long the syscall
takes (which also implies how long it will take for a termination signal
like <ctrl>C to be handled), the most efficient usage is to specify
a "len" argument equal to UINT64_MAX.
--> This will usually copy the whole file in one gulp, although it is not
       guaranteed to copy everything, given the Linux semantics definition
       of it (an NFSv4.2 server can simply choose to copy less, for example=
).
       --> This allows the kernel to use whatever block size works efficien=
tly
             and does not require an allocation of a large userspace buffer=
 for
             the date, nor that the data be copied to/from userspace.

The problem with doing the whole file in one gulp are:
- A large file can take quite a while and any signal won't be processed unt=
il
  the gulp is done.
  --> If you wrote a program that allocated a 100Gbyte buffer and then
        copied a file using read(2)/write(2) with a size of 100Gbytes in a =
loop,
        you'd end up with the same result.
- As kib@ noted, if the input file never reports EOF (as /dev/zero does),
      then the "one gulp" wouldn't end until storage is exhausted on the
      output file(s) device and <crtl>C wouldn't stop it (since it is one b=
ig
      syscall).
     --> As such, I suggested that, if the syscall is extended to allow VCH=
R,
           that the "len" argument be clipped at "K Mbytes" for that case t=
o
           avoid filling the storage device before being able to <ctrl>C ou=
t
           of it, for this case.
I suppose the answer for #3 is...
- smaller "len" allows for quicker response to signals
but
- smaller "len" results in less efficient use of the syscall.

Your patch for "cp" seemed fine, but used a small "len" and, as such,
made the use of copy_file_range(2) less efficient.

All I see the wrapper dong is handling the VCHR case (if the syscall remain=
s
as it is now and returns EINVAL to be compatible with Linux) and making
some rather arbitrary choice w.r.t. how big "len" should be.
--> Choosing an appropriate "len" might better be left to the specific use
      case, I think?

In summary, it's mostly whether VCHR gets handled by the syscall or a
wrapper?

> 1) In order to quickly respond to a signal, a program must use a modest l=
en with > copy_file_range
Does this matter? Or put another way, is a 1-2sec delay in response to <crt=
l>C
an issue for "cp".
When kib@ reviewed the syscall, he did not see the delay in signal handling
a significant problem, noting that it is no different than a large process =
core
dumping.

> 2) If a hole is larger than len, that will cause vn_generic_copy_file_ran=
ge to
> truncate the output file to the middle of the hole.  Then, in the next in=
vocation,
> truncate it again to a larger size.
> 3) The result is a file that is not as sparse as the original.
>
> For example, on UFS:
> $ truncate -s 1g sparsefile
> $ cp sparsefile sparsefile2
> $ du -sh sparsefile*
>  96K sparsefile
> 32M sparsefile2
If you care about maintaining sparseness, a "len" of 100Mbytes or more woul=
d
be a reasonable choice. Since "cp" has never maintained sparseness, I didn'=
t
suggest such a size when I reviewed your patch for "cp".
--> I/O subsystem performance varies widely, but I think 100Mbytes will lim=
it
      the delay in signal handling to about 1sec. Isn't that quick enough?

> My idea for a userland wrapper would solve this problem by using
> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_ra=
nge for
> everything else with a modest len.  Alternatively, we could eliminate the=
 need for
> the wrapper by enabling copy_file_range for every file system, and making
> vn_generic_copy_file_range interruptible, so copy_file_range can be calle=
d with
> large len without penalizing signal handling performance.
The problem with doing this is it largely defeats the purpose of copy_file_=
range().
1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE.
     (All NFS mounts except NFSv4.2 ones against servers that support the
      NFSv4.2 Seek operation are in this category.)
2 - For NFSv4.2 with servers that support Seek, the copy of an entire file
     can be done via a few (or only one) RPC if you make "len" large and
     don't use Seek.
     If you combine using Seek with len =3D=3D2Mbytes, then you do a lot mo=
re RPCs
     with associated overheads and RPC RTT delays. You still avoid moving a=
ll
     the data across the wire, but you do lose a lot of the performance adv=
antage.

I could have made copy_file_range(2) a lot simpler if the generic code didn=
't
try and maintain holes, but I wanted it to work well for file systems that =
did
not support SEEK_DATA/SEEK_HOLE.

I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_ran=
ge()
and test that.
You should fine the sparseness is mostly maintained and that you can <crtl>=
C
out of a large file copy without undue delay.
Then try it over NFS mounts (both v4.2 and v3) for the same large sparse fi=
le.

You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see
how they compare.

rick


-Alan
_______________________________________________
freebsd-hackers@freebsd.org mailing list
https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f=
reebsd.org%2Fmailman%2Flistinfo%2Ffreebsd-hackers&amp;data=3D02%7C01%7C%7C2=
7ea5166cf99415d3bba08d85de6d259%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%=
7C637362593231297450&amp;sdata=3DSfm9MxjQ6MVHgG%2Fw3sghn0hebSFjZo%2FSaUyZ9H=
Pyws8%3D&amp;reserved=3D0
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
_______________________________________________
freebsd-hackers@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB396664A1DAE4A2742C22385ADD350>