Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 26 Sep 2020 23:22:30 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Chris Stephan <chris.stephan@live.com>, Alan Somers <asomers@freebsd.org>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: RFC: copy_file_range(3)
Message-ID:  <YTBPR01MB3966320580FC5D659F0911D1DD370@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <SN6PR02MB5487E40F82CC231B5E63A7E89B370@SN6PR02MB5487.namprd02.prod.outlook.com>
References:  <CAOtMX2iFZZpoj%2Bap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com> <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com> <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA@mail.gmail.com>, <YTBPR01MB396622BAC24ECA15F5421678DD3A0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <SN6PR02MB5487E40F82CC231B5E63A7E89B370@SN6PR02MB5487.namprd02.prod.outlook.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Chris Stephan wrote:=0A=
> New to the list and Late to the discussion. I am thinking increasing the =
Len could=0A=
> cause possible degradation of performance when used on slower or legacy =
=0A=
> systems. On the other hand just picking a len and cutting it off at a har=
d max =0A=
> seems crude even with a tunable. Admittedly my naive opinion in this matt=
er =0A=
> ponders, could there be a sysctl tunable that just sets an estimate on ti=
meframe =0A=
> instead of size. As you said 100M is roughly a sec on modem hardware. IOP=
S are=0A=
> already tracked. The inverse of the avg IOPS for the filesystem in questi=
on could =0A=
> be used against this tunable to derive the estimated size limit of the ne=
xt =0A=
> read/write. This would allow the max len within the syscall to both honor=
 a =0A=
> timeframe before a signal would be handled and maximize efficiency across=
 a=0A=
> large breadth of systems varying in performance. I=92m sure it is more co=
mplicated =0A=
> than I suggest... just tossing my 2c in.=0A=
Yes. Using time will work for the generic copy case and I think that's a go=
od idea.=0A=
Then we can leave the file system specific cases up to the implementors.=0A=
(For NFSv4.2, once you send the RPC to the server, the client has no contro=
l over=0A=
 how long it takes to reply. The current sysctl that sets a size is still a=
bout all the=0A=
 NFSv4.2 code can do.)=0A=
=0A=
Thanks for the suggestion, rick=0A=
=0A=
Chris=0A=
=0A=
Sent from FreeBSD=0A=
________________________________=0A=
From: owner-freebsd-hackers@freebsd.org <owner-freebsd-hackers@freebsd.org>=
 on behalf of Rick Macklem <rmacklem@uoguelph.ca>=0A=
Sent: Sunday, September 20, 2020 11:28:21 PM=0A=
To: Alan Somers <asomers@freebsd.org>=0A=
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>=0A=
Subject: Re: RFC: copy_file_range(3)=0A=
=0A=
[I have only indented your most recent comments]=0A=
Alan Somers wrote:=0A=
On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem <rmacklem@uoguelph.ca<mailto:r=
macklem@uoguelph.ca>> wrote:=0A=
Alan Somers wrote:=0A=
>On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem@uoguelph.ca<mailto:=
rmacklem@uoguelph.ca><mailto:rmacklem@uoguelph.ca<mailto:rmacklem@uoguelph.=
ca>>> wrote:=0A=
>>Alan Somers wrote:=0A=
>>>copy_file_range(2) is nifty, but it has a few sharp edges:=0A=
>>>1) Certain file systems don't support it, necessitating a write/read bas=
ed=0A=
>>>fallback=0A=
>>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA=0A=
>>>3) It's slightly tricky to both efficiently deal with holes and also=0A=
>>>promptly respond to signals=0A=
>>>=0A=
>>>These problems aren't terribly hard, but it seems to me like most=0A=
>>>applications that use copy_file_range would share the exact same=0A=
>>>solutions.  In particular, I'm thinking about cp(1), dd(1), and=0A=
>>>install(8).  Those three could benefit from sharing a userland wrapper t=
hat=0A=
>>>handles the above problems.=0A=
>>>=0A=
>>>Should we add such a wrapper to libc?  If so, what should it be called, =
and=0A=
>>>should it be public or just private to /usr/src ?=0A=
>>There has been a discussion on src-committers which I suggested should=0A=
>>be taken to a public mailing list.=0A=
>>=0A=
>>The basic question is...=0A=
>>Whether or not the copy_file_range(2) syscall should be compatible with=
=0A=
>>the Linux one.=0A=
>>When I did the syscall, I tried to make it Linux-compatible, arguing that=
=0A=
>>Linux is now a de-facto standard.=0A=
>>The Linux syscall only works on regular files, which is why Alan's patch =
for=0A=
>>cp required a "fallback to the old way" for VCHR files like /dev/null.=0A=
>>=0A=
>>He is considering a wrapper in libc to provide FreeBSD specific semantics=
,=0A=
>>which I have no problem with, so long as the naming and man page make=0A=
>>it clear that it is not compatible with the Linux syscall.=0A=
>>(Personally, I'd prefer a wrapper in libc to making the actual syscall no=
n-Linux=0A=
>> compatible, but that is just mho.)=0A=
>>=0A=
>>Hopefully this helps clarify what Alan is asking, rick=0A=
>>=0A=
>>I don't think the two questions are equivalent.  I think that copy_file_r=
ange(2) >>ought to work on character devices.  Separately, even it does, I =
think a userland >>wrapper would still be useful.  It would still be able t=
o handle sparse files more >>efficiently than the kernel-based vn_generic_c=
opy_file_range.=0A=
I saw this also stated in your #2 above, but wonder why you think a wrapper=
=0A=
would handle holes more efficiently.=0A=
vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE=0A=
just like a wrapper would and retains them as far as possible. It also look=
s=0A=
for blocks of all zero bytes for file systems that do not support SEEK_DATA=
/=0A=
SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in=
=0A=
the output file.=0A=
--> The only cases that I am aware of where the holes are not retained are:=
=0A=
     - When the min holesize for the output file is larger than that of the=
=0A=
       input file.=0A=
     - When the hole straddles the byte range specified for the syscall.=0A=
       (Or when the hole straddles two copy_file_range(2) syscalls, if you=
=0A=
        prefer.)=0A=
=0A=
If you are copying the entire file and do not care how long the syscall=0A=
takes (which also implies how long it will take for a termination signal=0A=
like <ctrl>C to be handled), the most efficient usage is to specify=0A=
a "len" argument equal to UINT64_MAX.=0A=
--> This will usually copy the whole file in one gulp, although it is not=
=0A=
       guaranteed to copy everything, given the Linux semantics definition=
=0A=
       of it (an NFSv4.2 server can simply choose to copy less, for example=
).=0A=
       --> This allows the kernel to use whatever block size works efficien=
tly=0A=
             and does not require an allocation of a large userspace buffer=
 for=0A=
             the date, nor that the data be copied to/from userspace.=0A=
=0A=
The problem with doing the whole file in one gulp are:=0A=
- A large file can take quite a while and any signal won't be processed unt=
il=0A=
  the gulp is done.=0A=
  --> If you wrote a program that allocated a 100Gbyte buffer and then=0A=
        copied a file using read(2)/write(2) with a size of 100Gbytes in a =
loop,=0A=
        you'd end up with the same result.=0A=
- As kib@ noted, if the input file never reports EOF (as /dev/zero does),=
=0A=
      then the "one gulp" wouldn't end until storage is exhausted on the=0A=
      output file(s) device and <crtl>C wouldn't stop it (since it is one b=
ig=0A=
      syscall).=0A=
     --> As such, I suggested that, if the syscall is extended to allow VCH=
R,=0A=
           that the "len" argument be clipped at "K Mbytes" for that case t=
o=0A=
           avoid filling the storage device before being able to <ctrl>C ou=
t=0A=
           of it, for this case.=0A=
I suppose the answer for #3 is...=0A=
- smaller "len" allows for quicker response to signals=0A=
but=0A=
- smaller "len" results in less efficient use of the syscall.=0A=
=0A=
Your patch for "cp" seemed fine, but used a small "len" and, as such,=0A=
made the use of copy_file_range(2) less efficient.=0A=
=0A=
All I see the wrapper dong is handling the VCHR case (if the syscall remain=
s=0A=
as it is now and returns EINVAL to be compatible with Linux) and making=0A=
some rather arbitrary choice w.r.t. how big "len" should be.=0A=
--> Choosing an appropriate "len" might better be left to the specific use=
=0A=
      case, I think?=0A=
=0A=
In summary, it's mostly whether VCHR gets handled by the syscall or a=0A=
wrapper?=0A=
=0A=
> 1) In order to quickly respond to a signal, a program must use a modest l=
en with > copy_file_range=0A=
Does this matter? Or put another way, is a 1-2sec delay in response to <crt=
l>C=0A=
an issue for "cp".=0A=
When kib@ reviewed the syscall, he did not see the delay in signal handling=
=0A=
a significant problem, noting that it is no different than a large process =
core=0A=
dumping.=0A=
=0A=
> 2) If a hole is larger than len, that will cause vn_generic_copy_file_ran=
ge to=0A=
> truncate the output file to the middle of the hole.  Then, in the next in=
vocation,=0A=
> truncate it again to a larger size.=0A=
> 3) The result is a file that is not as sparse as the original.=0A=
>=0A=
> For example, on UFS:=0A=
> $ truncate -s 1g sparsefile=0A=
> $ cp sparsefile sparsefile2=0A=
> $ du -sh sparsefile*=0A=
>  96K sparsefile=0A=
> 32M sparsefile2=0A=
If you care about maintaining sparseness, a "len" of 100Mbytes or more woul=
d=0A=
be a reasonable choice. Since "cp" has never maintained sparseness, I didn'=
t=0A=
suggest such a size when I reviewed your patch for "cp".=0A=
--> I/O subsystem performance varies widely, but I think 100Mbytes will lim=
it=0A=
      the delay in signal handling to about 1sec. Isn't that quick enough?=
=0A=
=0A=
> My idea for a userland wrapper would solve this problem by using=0A=
> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_ra=
nge for=0A=
> everything else with a modest len.  Alternatively, we could eliminate the=
 need for=0A=
> the wrapper by enabling copy_file_range for every file system, and making=
=0A=
> vn_generic_copy_file_range interruptible, so copy_file_range can be calle=
d with=0A=
> large len without penalizing signal handling performance.=0A=
The problem with doing this is it largely defeats the purpose of copy_file_=
range().=0A=
1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE.=0A=
     (All NFS mounts except NFSv4.2 ones against servers that support the=
=0A=
      NFSv4.2 Seek operation are in this category.)=0A=
2 - For NFSv4.2 with servers that support Seek, the copy of an entire file=
=0A=
     can be done via a few (or only one) RPC if you make "len" large and=0A=
     don't use Seek.=0A=
     If you combine using Seek with len =3D=3D2Mbytes, then you do a lot mo=
re RPCs=0A=
     with associated overheads and RPC RTT delays. You still avoid moving a=
ll=0A=
     the data across the wire, but you do lose a lot of the performance adv=
antage.=0A=
=0A=
I could have made copy_file_range(2) a lot simpler if the generic code didn=
't=0A=
try and maintain holes, but I wanted it to work well for file systems that =
did=0A=
not support SEEK_DATA/SEEK_HOLE.=0A=
=0A=
I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_ran=
ge()=0A=
and test that.=0A=
You should fine the sparseness is mostly maintained and that you can <crtl>=
C=0A=
out of a large file copy without undue delay.=0A=
Then try it over NFS mounts (both v4.2 and v3) for the same large sparse fi=
le.=0A=
=0A=
You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see=0A=
how they compare.=0A=
=0A=
rick=0A=
=0A=
=0A=
-Alan=0A=
_______________________________________________=0A=
freebsd-hackers@freebsd.org mailing list=0A=
https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f=
reebsd.org%2Fmailman%2Flistinfo%2Ffreebsd-hackers&amp;data=3D02%7C01%7C%7C2=
7ea5166cf99415d3bba08d85de6d259%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%=
7C637362593231297450&amp;sdata=3DSfm9MxjQ6MVHgG%2Fw3sghn0hebSFjZo%2FSaUyZ9H=
Pyws8%3D&amp;reserved=3D0=0A=
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"=
=0A=



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB3966320580FC5D659F0911D1DD370>