From owner-freebsd-hackers@freebsd.org  Wed Sep 23 17:52:57 2020
Return-Path: <owner-freebsd-hackers@freebsd.org>
Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 6A1F2421458
 for <freebsd-hackers@mailman.nyi.freebsd.org>;
 Wed, 23 Sep 2020 17:52:57 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: from mail-oo1-f48.google.com (mail-oo1-f48.google.com
 [209.85.161.48])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
 client-signature RSA-PSS (2048 bits) client-digest SHA256)
 (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 4BxQkc4VqWz3y9s;
 Wed, 23 Sep 2020 17:52:56 +0000 (UTC)
 (envelope-from asomers@gmail.com)
Received: by mail-oo1-f48.google.com with SMTP id z1so94866ooj.3;
 Wed, 23 Sep 2020 10:52:56 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:references:in-reply-to:from:date
 :message-id:subject:to:cc;
 bh=Xd9m/qwfyqgT4/vy2Tj7i/XHo2CULMby8+BqNHFwPnE=;
 b=N3uvq2XfMNNqSDVTlAIaLmXebLaPltGzfiee+f7X+ieKgKDSJ9EnB0aQOjFZFM/OxJ
 21vYXDMcl5zitwtzNAkUtR/q0K9tVkp9gvFxqvMBSkzi1B5RsH5AKSBJI2EeeOTv9lcu
 Lec5DBeHNam1r1eL0Xjx7OUsKLF/S4kpbQ68VSmKfWzMfYOSbjkhm1eqEoDI3O7oAYy8
 q3XS/nW0uUIXqf+xI0+yTyMRbhwqxMGygm2+r6lyFPd7skkGqvTII+BDM1MNpsOdzIxL
 WAxofuxEB8y8VAlTle+C91y+CEQjBLkZoXAcyOVF71Ai9KhGwtIjJbOM4KoslQ/wWDuS
 VhRg==
X-Gm-Message-State: AOAM531y283rAq53DP7kbQd0MOiZpMQPXQRgQfLMVeUvaklOCfIpBJlx
 uPCnALPGUfPChnHRLaX2x8qDOJ3P1RApuaTDfUo5Cf+M
X-Google-Smtp-Source: ABdhPJxJynFIdDdw8WLwX78CuVDxPCXjoYerlKPcQgoUvKtBz33duPU6SkPjZhqsVQVh3gAYOkLaM7Lfa7uCDsgomx4=
X-Received: by 2002:a4a:e544:: with SMTP id s4mr634719oot.74.1600883575103;
 Wed, 23 Sep 2020 10:52:55 -0700 (PDT)
MIME-Version: 1.0
References: <CAOtMX2iFZZpoj+ap21rrju4hJoip6ZoyxEiCB8852NeH7DAN0Q@mail.gmail.com>
 <YTBPR01MB39666188FC89399B0D632FE8DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
 <CAOtMX2gMYdcx0CUC1Mky3ETFr1JkBbYzn17i11axSW=HRTL7OA@mail.gmail.com>
 <YTBPR01MB3966C1D4D10BE836B37955F5DD3D0@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
 <CAOtMX2jHMRD0Hno03f2dqjJToR152u8d-_40GM_+BvNPkN_smA@mail.gmail.com>
 <YTBPR01MB3966BA18F43F7B6353171E67DD380@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
 <YTBPR01MB39666626FF10803E5D4EF3D2DD380@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <YTBPR01MB39666626FF10803E5D4EF3D2DD380@YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM>
From: Alan Somers <asomers@freebsd.org>
Date: Wed, 23 Sep 2020 11:52:43 -0600
Message-ID: <CAOtMX2gSc8EF-GCeiDhq3zmQzSXicb2haT_RzvG4XosgrH0Ugg@mail.gmail.com>
Subject: Re: RFC: copy_file_range(3)
To: Rick Macklem <rmacklem@uoguelph.ca>
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>,
 Konstantin Belousov <kib@freebsd.org>
X-Rspamd-Queue-Id: 4BxQkc4VqWz3y9s
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates
 209.85.161.48 as permitted sender) smtp.mailfrom=asomers@gmail.com
X-Spamd-Result: default: False [-2.36 / 15.00]; RCVD_TLS_ALL(0.00)[];
 ARC_NA(0.00)[]; RCVD_COUNT_TWO(0.00)[2];
 FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[];
 RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17];
 NEURAL_HAM_LONG(-1.00)[-1.001];
 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
 DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-0.90)[-0.905];
 RWL_MAILSPIKE_GOOD(0.00)[209.85.161.48:from];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[];
 NEURAL_HAM_SHORT(-0.45)[-0.451];
 RCVD_IN_DNSWL_NONE(0.00)[209.85.161.48:from];
 FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com];
 R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com];
 ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US];
 MIME_TRACE(0.00)[0:+,1:+,2:~];
 FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com];
 MAILMAN_DEST(0.00)[freebsd-hackers]
Content-Type: text/plain; charset="UTF-8"
X-Content-Filtered-By: Mailman/MimeDel 2.1.33
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.33
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 23 Sep 2020 17:52:57 -0000

On Wed, Sep 23, 2020 at 9:08 AM Rick Macklem <rmacklem@uoguelph.ca> wrote:

> Rick Macklem wrote:
> >Alan Somers wrote:
> >[lots of stuff snipped]
> >>1) In order to quickly respond to a signal, a program must use a modest
> len with >>copy_file_range
> >For the programs you have mentioned, I think the only signal handling
> would
> >be termination (<ctrl>C or SIGTERM if you prefer).
> >I'm not sure what is a reasonable response time for this.
> >I'd like to hear comments from others?
> >- 1sec, less than 1sec, a few seconds, ...
> >
> >> 2) If a hole is larger than len, that will cause
> vn_generic_copy_file_range to
> >> truncate the output file to the middle of the hole.  Then, in the next
> invocation,
> >> truncate it again to a larger size.
> >> 3) The result is a file that is not as sparse as the original.
> >Yes. So, the trick is to use the largest "len" you can live with, given
> how long you
> >are willing to wait for signal processing.
> >
> >> For example, on UFS:
> >> $ truncate -s 1g sparsefile
> >Not a very interesting sparse file. I wrote a little program to create
> one.
> >> $ cp sparsefile sparsefile2
> >> $ du -sh sparsefile*
> >>  96K sparsefile
> >>  32M sparsefile2
> Btw, this happens because, at least for UFS (not sure about other file
> systems), if you grow a file's size via VOP_SETATTR() of size, it
> allocates a
> block at the new EOF, even though no data has been written there.
> --> This results in one block being allocated at the end of the range used
>     for a copy_file_range() call, if that file offset is within a hole.
>     --> The larger the "len" argument, the less frequently it will occur.
>
> >>
> >> My idea for a userland wrapper would solve this problem by using
> >> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use
> copy_file_range for
> >> everything else with a modest len.  Alternatively, we could eliminate
> the need for
> >> the wrapper by enabling copy_file_range for every file system, and
> making
> >> vn_generic_copy_file_range interruptible, so copy_file_range can be
> called with
> >> large len without penalizing signal handling performance.
> >
> >Well, I ran some quick benchmarks using the attached programs, plus "cp"
> both
> >before and with your copy_file_range() patch.
> >copya - Does what I think your plan is above, with a limit of 2Mbytes for
> "len".
> >copyb -Just uses copy_file_range() with 128Mbytes for "len".
> >
> >I first created the sparse file with createsparse.c. It is admittedly a
> worst case,
> >creating alternating holes and data blocks of the minimum size supported
> by
> >the file system. (I ran it on a UFS file system created with defaults, so
> the minimum
> >>hole size is 32Kbytes.)
> >The file is 1Gbyte in size with an Allocation size of 524576 ("ls -ls").
> >
> >I then ran copya, copyb, old-cp and new-cp. For NFS, I redid the mount
> before
> >each copy to avoid data caching in the client.
> >Here's what I got:
> >                      Elapsed time           #RPCs
> Allocation size ("ls -ls" on server)
> >NFSv4.2
> >copya             39.7sec          16384copy+32768seek       524576
> >copyb             10.2sec          104copy
> 524576
> When I ran the tests I had vfs.nfs.maxcopyrange set to 128Mbytes on the
> server. However it was still the default of 10Mbytes on the client,
> so this test run used 10Mbytes per Copy. (I wondered why it did 104
> Copyies?)
> With both set to 128Mbytes I got:
> copyb                10.0sec          8copy
>   524576
> >old-cp             21.9sec          16384read+16384write      1048864
> >new-cp            10.5sec          1024copy
> 524576
> >
> >NFSv4.1
> >copya             21.8sec          16384read+16384write      1048864
> >copyb             21.0sec          16384read+16384write      1048864
> >old-cp             21.8sec          16384read+16384write      1048864
> >new-cp           21.4sec           16384read+16384write      1048864
> >
> >Local on the UFS file system
> >copya             9.2sec                       n/a
>      524576
> This turns out to be just variability in the test. I get 7.9sec->9.2sec
> for runs of all three of copya, copyb and new-cp for UFS.
> I think it is caching related, since I wasn't unmounting/remounting the
> UFS file system between test runs.
> >copyb             8.0sec                       n/a
>      524576
> >old-cp            15.9sec                      n/a
>     1048864
> >new-cp           7.9sec                        n/a
>      524576
> >
> >So, for a NFSv4.2 mount, using SEEK_DATA/SEEK_HOLE is definitely
> >a performance hit, due to all the RPC rtts.
> >Your patched "cp" does fine, although a larger "len" reduces the
> >RPC count against the server.
> >All variants using copy_file_range() retain the holes.
> >
> >For NFSv4.1, it (not surprisingly) doesn't matter, since only NFSv4.2
> >supports SEEK_DATA/SEEK_HOLE and VOP_COPY_FILE_RANGE().
> >
> >For UFS, everything using copy_file_range() works pretty well and
> >retains the holes.
>
> >Although "copya" is guaranteed to retain the holes, it does run noticably
> >slower than the others. Not sure why? Does the extra SEEK_DATA/SEEK_HOLE
> >syscalls cost that much?
> Ignore this. It was just variability in the test runs.
>
> >The limitation of not using SEEK_DATA/SEEK_HOLE is that you will not
> >retain holes that straddle the byte range copied by two subsequent
> >copy_file_range(2) calls.
> This statement is misleading. These holes are partially retained, but there
> will be a block allocated (at least for UFS) at the boundary, due the
> property of
> growing a file via VOP_SETATTR(size) as noted above.
>
> >--> This can be minimized by using a large "len", but that large "len"
> >      results in slower response to signal handling.
> I'm going to play with "len" to-day and come up with some numbers
> w.r.t. signal handling response time vs the copy_file_range() "len"
> argument.
>
> >I've attached the little programs, so you can play with them.
> >(Maybe try different sparse schemes/sizes? It might be fun to
> > make the holes/blocks some random multiple of hole size up
> > to a limit?)
> >
> >rick
> >ps: In case he isn't reading hackers these days, I've added kib@
> >      as a cc. He might know why UFS is 15% slower when SEEK_HOLE
> >      SEEK_DATA is used.
>

So it sounds like your main point is that for file systems with special
support, copy_file_range(2) is more efficient for many sparse files than
SEEK_HOLE/SEEK_DATA.  Sure, I buy that.  And secondarily, you don't see any
reason not to increase the len argument in commands like cp up to somewhere
around 128 MB, delaying signal handling for about 1 second on a typical
desktop (maybe set it lower on embedded arches).  And you think it's fine
to allow copy_file_range on devfs, as long as the len argument is clipped
at some finite value.  If we make all of those changes, are there any other
reasons why the write/read fallback path would be needed?
-Alan