Date: Fri, 2 Oct 2020 15:47:37 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Chris Stephan <chris.stephan@live.com>, Alan Somers <asomers@freebsd.org> Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org> Subject: Re: RFC: copy_file_range(3) Message-ID: <YTBPR01MB3966DBB6148A31A486F39C8BDD310@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <YTBPR01MB396664A1DAE4A2742C22385ADD350@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> References: <CAOtMX2iFZZpoj%2Bap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com> <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com> <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA@mail.gmail.com>, <YTBPR01MB396622BAC24ECA15F5421678DD3A0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <SN6PR02MB5487E40F82CC231B5E63A7E89B370@SN6PR02MB5487.namprd02.prod.outlook.com>, <YTBPR01MB3966320580FC5D659F0911D1DD370@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>, <YTBPR01MB396664A1DAE4A2742C22385ADD350@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
next in thread | previous in thread | raw e-mail | index | archive | help
[stuff snipped]=0A= Rick Macklem wrote:=0A= >Chris Stephan wrote:=0A= >> New to the list and Late to the discussion. I am thinking increasing the= Len could=0A= >> cause possible degradation of performance when used on slower or legacy= =0A= >> systems. On the other hand just picking a len and cutting it off at a ha= rd max=0A= >> seems crude even with a tunable. Admittedly my naive opinion in this mat= ter=0A= >> ponders, could there be a sysctl tunable that just sets an estimate on t= imeframe=0A= >> instead of size. As you said 100M is roughly a sec on modem hardware. IO= PS are=0A= >> already tracked. The inverse of the avg IOPS for the filesystem in quest= ion could=0A= >> be used against this tunable to derive the estimated size limit of the n= ext=0A= >> read/write. This would allow the max len within the syscall to both hono= r a=0A= >> timeframe before a signal would be handled and maximize efficiency acros= s a=0A= >> large breadth of systems varying in performance. I=92m sure it is more c= omplicated=0A= >> than I suggest... just tossing my 2c in.=0A= >Yes. Using time will work for the generic copy case and I think that's a g= ood idea.=0A= >Then we can leave the file system specific cases up to the implementors.= =0A= >(For NFSv4.2, once you send the RPC to the server, the client has no contr= ol over=0A= > how long it takes to reply. The current sysctl that sets a size is still = about all the=0A= > NFSv4.2 code can do.)=0A= When I looked at a wireshark packet trace, it turned out that the Copy RPC= =0A= happened quickly and it was the subsequent Commit RPC that could take=0A= 1sec or more.=0A= As such, setting a time limit on Copy was not useful.=0A= Testing shows that 16Mbytes/Copy is small enough to keep the Commit RPC=0A= well below 1sec even on really slow server hardware (Pentium 4 with IDE dis= k).=0A= There was also no appreciable performance improvement for Copy sizes=0A= greater than 16Mbytes for the testing I did.=0A= As such, I think the vfs.nfs.maxcopyrange sysctl with a default of 16Mbytes= =0A= is all that can be done for NFSv4.2.=0A= =0A= For local file systems, a patch that detects pending signals is in progress= .=0A= =0A= rick=0A= =0A= Thanks for the suggestion, rick=0A= =0A= Chris=0A= =0A= Sent from FreeBSD=0A= ________________________________=0A= From: owner-freebsd-hackers@freebsd.org <owner-freebsd-hackers@freebsd.org>= on behalf of Rick Macklem <rmacklem@uoguelph.ca>=0A= Sent: Sunday, September 20, 2020 11:28:21 PM=0A= To: Alan Somers <asomers@freebsd.org>=0A= Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>=0A= Subject: Re: RFC: copy_file_range(3)=0A= =0A= [I have only indented your most recent comments]=0A= Alan Somers wrote:=0A= On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem <rmacklem@uoguelph.ca<mailto:r= macklem@uoguelph.ca>> wrote:=0A= Alan Somers wrote:=0A= >On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem <rmacklem@uoguelph.ca<mailto:= rmacklem@uoguelph.ca><mailto:rmacklem@uoguelph.ca<mailto:rmacklem@uoguelph.= ca>>> wrote:=0A= >>Alan Somers wrote:=0A= >>>copy_file_range(2) is nifty, but it has a few sharp edges:=0A= >>>1) Certain file systems don't support it, necessitating a write/read bas= ed=0A= >>>fallback=0A= >>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA=0A= >>>3) It's slightly tricky to both efficiently deal with holes and also=0A= >>>promptly respond to signals=0A= >>>=0A= >>>These problems aren't terribly hard, but it seems to me like most=0A= >>>applications that use copy_file_range would share the exact same=0A= >>>solutions. In particular, I'm thinking about cp(1), dd(1), and=0A= >>>install(8). Those three could benefit from sharing a userland wrapper t= hat=0A= >>>handles the above problems.=0A= >>>=0A= >>>Should we add such a wrapper to libc? If so, what should it be called, = and=0A= >>>should it be public or just private to /usr/src ?=0A= >>There has been a discussion on src-committers which I suggested should=0A= >>be taken to a public mailing list.=0A= >>=0A= >>The basic question is...=0A= >>Whether or not the copy_file_range(2) syscall should be compatible with= =0A= >>the Linux one.=0A= >>When I did the syscall, I tried to make it Linux-compatible, arguing that= =0A= >>Linux is now a de-facto standard.=0A= >>The Linux syscall only works on regular files, which is why Alan's patch = for=0A= >>cp required a "fallback to the old way" for VCHR files like /dev/null.=0A= >>=0A= >>He is considering a wrapper in libc to provide FreeBSD specific semantics= ,=0A= >>which I have no problem with, so long as the naming and man page make=0A= >>it clear that it is not compatible with the Linux syscall.=0A= >>(Personally, I'd prefer a wrapper in libc to making the actual syscall no= n-Linux=0A= >> compatible, but that is just mho.)=0A= >>=0A= >>Hopefully this helps clarify what Alan is asking, rick=0A= >>=0A= >>I don't think the two questions are equivalent. I think that copy_file_r= ange(2) >>ought to work on character devices. Separately, even it does, I = think a userland >>wrapper would still be useful. It would still be able t= o handle sparse files more >>efficiently than the kernel-based vn_generic_c= opy_file_range.=0A= I saw this also stated in your #2 above, but wonder why you think a wrapper= =0A= would handle holes more efficiently.=0A= vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE=0A= just like a wrapper would and retains them as far as possible. It also look= s=0A= for blocks of all zero bytes for file systems that do not support SEEK_DATA= /=0A= SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in= =0A= the output file.=0A= --> The only cases that I am aware of where the holes are not retained are:= =0A= - When the min holesize for the output file is larger than that of the= =0A= input file.=0A= - When the hole straddles the byte range specified for the syscall.=0A= (Or when the hole straddles two copy_file_range(2) syscalls, if you= =0A= prefer.)=0A= =0A= If you are copying the entire file and do not care how long the syscall=0A= takes (which also implies how long it will take for a termination signal=0A= like <ctrl>C to be handled), the most efficient usage is to specify=0A= a "len" argument equal to UINT64_MAX.=0A= --> This will usually copy the whole file in one gulp, although it is not= =0A= guaranteed to copy everything, given the Linux semantics definition= =0A= of it (an NFSv4.2 server can simply choose to copy less, for example= ).=0A= --> This allows the kernel to use whatever block size works efficien= tly=0A= and does not require an allocation of a large userspace buffer= for=0A= the date, nor that the data be copied to/from userspace.=0A= =0A= The problem with doing the whole file in one gulp are:=0A= - A large file can take quite a while and any signal won't be processed unt= il=0A= the gulp is done.=0A= --> If you wrote a program that allocated a 100Gbyte buffer and then=0A= copied a file using read(2)/write(2) with a size of 100Gbytes in a = loop,=0A= you'd end up with the same result.=0A= - As kib@ noted, if the input file never reports EOF (as /dev/zero does),= =0A= then the "one gulp" wouldn't end until storage is exhausted on the=0A= output file(s) device and <crtl>C wouldn't stop it (since it is one b= ig=0A= syscall).=0A= --> As such, I suggested that, if the syscall is extended to allow VCH= R,=0A= that the "len" argument be clipped at "K Mbytes" for that case t= o=0A= avoid filling the storage device before being able to <ctrl>C ou= t=0A= of it, for this case.=0A= I suppose the answer for #3 is...=0A= - smaller "len" allows for quicker response to signals=0A= but=0A= - smaller "len" results in less efficient use of the syscall.=0A= =0A= Your patch for "cp" seemed fine, but used a small "len" and, as such,=0A= made the use of copy_file_range(2) less efficient.=0A= =0A= All I see the wrapper dong is handling the VCHR case (if the syscall remain= s=0A= as it is now and returns EINVAL to be compatible with Linux) and making=0A= some rather arbitrary choice w.r.t. how big "len" should be.=0A= --> Choosing an appropriate "len" might better be left to the specific use= =0A= case, I think?=0A= =0A= In summary, it's mostly whether VCHR gets handled by the syscall or a=0A= wrapper?=0A= =0A= > 1) In order to quickly respond to a signal, a program must use a modest l= en with > copy_file_range=0A= Does this matter? Or put another way, is a 1-2sec delay in response to <crt= l>C=0A= an issue for "cp".=0A= When kib@ reviewed the syscall, he did not see the delay in signal handling= =0A= a significant problem, noting that it is no different than a large process = core=0A= dumping.=0A= =0A= > 2) If a hole is larger than len, that will cause vn_generic_copy_file_ran= ge to=0A= > truncate the output file to the middle of the hole. Then, in the next in= vocation,=0A= > truncate it again to a larger size.=0A= > 3) The result is a file that is not as sparse as the original.=0A= >=0A= > For example, on UFS:=0A= > $ truncate -s 1g sparsefile=0A= > $ cp sparsefile sparsefile2=0A= > $ du -sh sparsefile*=0A= > 96K sparsefile=0A= > 32M sparsefile2=0A= If you care about maintaining sparseness, a "len" of 100Mbytes or more woul= d=0A= be a reasonable choice. Since "cp" has never maintained sparseness, I didn'= t=0A= suggest such a size when I reviewed your patch for "cp".=0A= --> I/O subsystem performance varies widely, but I think 100Mbytes will lim= it=0A= the delay in signal handling to about 1sec. Isn't that quick enough?= =0A= =0A= > My idea for a userland wrapper would solve this problem by using=0A= > SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_ra= nge for=0A= > everything else with a modest len. Alternatively, we could eliminate the= need for=0A= > the wrapper by enabling copy_file_range for every file system, and making= =0A= > vn_generic_copy_file_range interruptible, so copy_file_range can be calle= d with=0A= > large len without penalizing signal handling performance.=0A= The problem with doing this is it largely defeats the purpose of copy_file_= range().=0A= 1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE.=0A= (All NFS mounts except NFSv4.2 ones against servers that support the= =0A= NFSv4.2 Seek operation are in this category.)=0A= 2 - For NFSv4.2 with servers that support Seek, the copy of an entire file= =0A= can be done via a few (or only one) RPC if you make "len" large and=0A= don't use Seek.=0A= If you combine using Seek with len =3D=3D2Mbytes, then you do a lot mo= re RPCs=0A= with associated overheads and RPC RTT delays. You still avoid moving a= ll=0A= the data across the wire, but you do lose a lot of the performance adv= antage.=0A= =0A= I could have made copy_file_range(2) a lot simpler if the generic code didn= 't=0A= try and maintain holes, but I wanted it to work well for file systems that = did=0A= not support SEEK_DATA/SEEK_HOLE.=0A= =0A= I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_ran= ge()=0A= and test that.=0A= You should fine the sparseness is mostly maintained and that you can <crtl>= C=0A= out of a large file copy without undue delay.=0A= Then try it over NFS mounts (both v4.2 and v3) for the same large sparse fi= le.=0A= =0A= You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see=0A= how they compare.=0A= =0A= rick=0A= =0A= =0A= -Alan=0A= _______________________________________________=0A= freebsd-hackers@freebsd.org mailing list=0A= https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reebsd.org%2Fmailman%2Flistinfo%2Ffreebsd-hackers&data=3D02%7C01%7C%7C2= 7ea5166cf99415d3bba08d85de6d259%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%= 7C637362593231297450&sdata=3DSfm9MxjQ6MVHgG%2Fw3sghn0hebSFjZo%2FSaUyZ9H= Pyws8%3D&reserved=3D0=0A= To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"= =0A= _______________________________________________=0A= freebsd-hackers@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-hackers=0A= To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"= =0A= _______________________________________________=0A= freebsd-hackers@freebsd.org mailing list=0A= https://lists.freebsd.org/mailman/listinfo/freebsd-hackers=0A= To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"= =0A=
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTBPR01MB3966DBB6148A31A486F39C8BDD310>