Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 25 Sep 2020 10:54:53 -0600
From:      Alan Somers <asomers@freebsd.org>
To:        Rick Macklem <rmacklem@uoguelph.ca>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
Subject:   Re: RFC: copy_file_range(3)
Message-ID:  <CAOtMX2i%2BaHxn_84GtTpngdvQkKi4qqNgNyZAtg5UAjxfO-ANJA@mail.gmail.com>
In-Reply-To: <YTBPR01MB3966F7C6C8C067DC4C4E5524DD360@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
References:  <CAOtMX2iFZZpoj%2Bap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com> <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com> <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_%2BBvNPkN_smA@mail.gmail.com> <YTBPR01MB3966BA18F43F7B6353171E67DD380@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <YTBPR01MB39666626FF10803E5D4EF3D2DD380@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM> <CAOtMX2gSc8EF-GCeiDhq3zmQzSXicb2haT_RzvG4XosgrH0Ugg@mail.gmail.com> <YTBPR01MB3966F7C6C8C067DC4C4E5524DD360@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>

next in thread | previous in thread | raw e-mail | index | archive | help
On Fri, Sep 25, 2020 at 10:26 AM Rick Macklem <rmacklem@uoguelph.ca> wrote:

> [the indentation seems to be a bit messed up, so I'll skip to near the
> end...]
> On Wed, Sep 23, 2020 at 9:08 AM Rick Macklem <rmacklem@uoguelph.ca<mailto:
> rmacklem@uoguelph.ca>> wrote:
> Rick Macklem wrote:
> >Alan Somers wrote:
> >[lots of stuff snipped]
> >>1) In order to quickly respond to a signal, a program must use a modest
> len with >>copy_file_range
> >For the programs you have mentioned, I think the only signal handling
> would
> >be termination (<ctrl>C or SIGTERM if you prefer).
> >I'm not sure what is a reasonable response time for this.
> >I'd like to hear comments from others?
> >- 1sec, less than 1sec, a few seconds, ...
> >
> >> 2) If a hole is larger than len, that will cause
> vn_generic_copy_file_range to
> >> truncate the output file to the middle of the hole.  Then, in the next
> invocation,
> >> truncate it again to a larger size.
> >> 3) The result is a file that is not as sparse as the original.
> >Yes. So, the trick is to use the largest "len" you can live with, given
> how long you
> >are willing to wait for signal processing.
> >
> >> For example, on UFS:
> >> $ truncate -s 1g sparsefile
> >Not a very interesting sparse file. I wrote a little program to create
> one.
> >> $ cp sparsefile sparsefile2
> >> $ du -sh sparsefile*
> >>  96K sparsefile
> >>  32M sparsefile2
> Btw, this happens because, at least for UFS (not sure about other file
> systems), if you grow a file's size via VOP_SETATTR() of size, it
> allocates a
> block at the new EOF, even though no data has been written there.
> --> This results in one block being allocated at the end of the range used
>     for a copy_file_range() call, if that file offset is within a hole.
>     --> The larger the "len" argument, the less frequently it will occur.
>
> >>
> >> My idea for a userland wrapper would solve this problem by using
> >> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use
> copy_file_range for
> >> everything else with a modest len.  Alternatively, we could eliminate
> the need for
> >> the wrapper by enabling copy_file_range for every file system, and
> making
> >> vn_generic_copy_file_range interruptible, so copy_file_range can be
> called with
> >> large len without penalizing signal handling performance.
> >
> >Well, I ran some quick benchmarks using the attached programs, plus "cp"
> both
> >before and with your copy_file_range() patch.
> >copya - Does what I think your plan is above, with a limit of 2Mbytes for
> "len".
> >copyb -Just uses copy_file_range() with 128Mbytes for "len".
> >
> >I first created the sparse file with createsparse.c. It is admittedly a
> worst case,
> >creating alternating holes and data blocks of the minimum size supported
> by
> >the file system. (I ran it on a UFS file system created with defaults, so
> the minimum
> >>hole size is 32Kbytes.)
> >The file is 1Gbyte in size with an Allocation size of 524576 ("ls -ls").
> >
> >I then ran copya, copyb, old-cp and new-cp. For NFS, I redid the mount
> before
> >each copy to avoid data caching in the client.
> >Here's what I got:
> >                      Elapsed time           #RPCs
> Allocation size ("ls -ls" on server)
> >NFSv4.2
> >copya             39.7sec          16384copy+32768seek       524576
> >copyb             10.2sec          104copy
> 524576
> When I ran the tests I had vfs.nfs.maxcopyrange set to 128Mbytes on the
> server. However it was still the default of 10Mbytes on the client,
> so this test run used 10Mbytes per Copy. (I wondered why it did 104
> Copyies?)
> With both set to 128Mbytes I got:
> copyb                10.0sec          8copy
>   524576
> >old-cp             21.9sec          16384read+16384write      1048864
> >new-cp            10.5sec          1024copy
> 524576
> >
> >NFSv4.1
> >copya             21.8sec          16384read+16384write      1048864
> >copyb             21.0sec          16384read+16384write      1048864
> >old-cp             21.8sec          16384read+16384write      1048864
> >new-cp           21.4sec           16384read+16384write      1048864
> >
> >Local on the UFS file system
> >copya             9.2sec                       n/a
>      524576
> This turns out to be just variability in the test. I get 7.9sec->9.2sec
> for runs of all three of copya, copyb and new-cp for UFS.
> I think it is caching related, since I wasn't unmounting/remounting the
> UFS file system between test runs.
> >copyb             8.0sec                       n/a
>      524576
> >old-cp            15.9sec                      n/a
>     1048864
> >new-cp           7.9sec                        n/a
>      524576
> >
> >So, for a NFSv4.2 mount, using SEEK_DATA/SEEK_HOLE is definitely
> >a performance hit, due to all the RPC rtts.
> >Your patched "cp" does fine, although a larger "len" reduces the
> >RPC count against the server.
> >All variants using copy_file_range() retain the holes.
> >
> >For NFSv4.1, it (not surprisingly) doesn't matter, since only NFSv4.2
> >supports SEEK_DATA/SEEK_HOLE and VOP_COPY_FILE_RANGE().
> >
> >For UFS, everything using copy_file_range() works pretty well and
> >retains the holes.
>
> >Although "copya" is guaranteed to retain the holes, it does run noticably
> >slower than the others. Not sure why? Does the extra SEEK_DATA/SEEK_HOLE
> >syscalls cost that much?
> Ignore this. It was just variability in the test runs.
>
> >The limitation of not using SEEK_DATA/SEEK_HOLE is that you will not
> >retain holes that straddle the byte range copied by two subsequent
> >copy_file_range(2) calls.
> This statement is misleading. These holes are partially retained, but there
> will be a block allocated (at least for UFS) at the boundary, due the
> property of
> growing a file via VOP_SETATTR(size) as noted above.
>
> >--> This can be minimized by using a large "len", but that large "len"
> >      results in slower response to signal handling.
> I'm going to play with "len" to-day and come up with some numbers
> w.r.t. signal handling response time vs the copy_file_range() "len"
> argument.
>
> >I've attached the little programs, so you can play with them.
> >(Maybe try different sparse schemes/sizes? It might be fun to
> > make the holes/blocks some random multiple of hole size up
> > to a limit?)
> >
> >rick
> >ps: In case he isn't reading hackers these days, I've added kib@
> >      as a cc. He might know why UFS is 15% slower when SEEK_HOLE
> >      SEEK_DATA is used.
> Alan Somers wrote:
> > So it sounds like your main point is that for file systems with special
> support,
> > copy_file_range(2) is more efficient for many sparse files than
> > SEEK_HOLE/SEEK_DATA.
> Well, for NFSv4.2 this is true. Who knows w.r.t. others in the future.
>
> >  Sure, I buy that.  And secondarily, you don't see any reason not to
> increase the
> > len argument in commands like cp up to somewhere around 128 MB, delaying
> > signal handling for about 1 second on a typical desktop (maybe set it
> lower on
> > embedded arches).
> When I did some testing on my hardware (laptops with slow spinning disks),
> I got up to about 2sec delay for 128Mbytes and up to about 1sec delay for
> 64Mbyes. I got a post that suggested that 1sec should be the target and
> haven't heard differently from anyone else.
>
> Currently, there is a sysctl for NFS that clips the size of a
> copy_file_range(),
> so that RPC response is reasonable (1sec or less).
> Maybe that sysctl should be replaced with a generic one for
> copy_file_range()
> with a default of 64->128Mbytes. (I might make NFS use 1/2 of the sysctl
> value, since the RPC response time shouldn't exceed 1sec.)
> Does this sound reasonable?
>
> >  And you think it's fine to allow copy_file_range on devfs, as long as
> the len
> > argument is clipped at some finite value.  If we make all of those
> changes, are
> >  there any other reasons why the write/read fallback path would be
> needed?
> I'm on the fence w.r.t. this one. I understand why you would prefer a call
> that
> worked for special files, but I also like the idea that it is "Linux
> compatible".
>

Here's another datapoint: the iSCSI protocol includes server-side copies
via the EXTENDED COPY command.  And it looks like ctl(4) already supports
that command.  Wouldn't it be great if iscsi(4) also supported it?  But it
can't without a syscall like copy_file_range(2) to use it.


>
> I'd like to hear feedback from others on this.
> Maybe I'll try asking this question separately on freebsd-current@ and
> see if I can get others to respond.
>
> rick'
>
> -Alan
>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAOtMX2i%2BaHxn_84GtTpngdvQkKi4qqNgNyZAtg5UAjxfO-ANJA>