From owner-freebsd-hackers@freebsd.org Wed Sep 23 17:52:57 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 6A1F2421458 for ; Wed, 23 Sep 2020 17:52:57 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oo1-f48.google.com (mail-oo1-f48.google.com [209.85.161.48]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4BxQkc4VqWz3y9s; Wed, 23 Sep 2020 17:52:56 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oo1-f48.google.com with SMTP id z1so94866ooj.3; Wed, 23 Sep 2020 10:52:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Xd9m/qwfyqgT4/vy2Tj7i/XHo2CULMby8+BqNHFwPnE=; b=N3uvq2XfMNNqSDVTlAIaLmXebLaPltGzfiee+f7X+ieKgKDSJ9EnB0aQOjFZFM/OxJ 21vYXDMcl5zitwtzNAkUtR/q0K9tVkp9gvFxqvMBSkzi1B5RsH5AKSBJI2EeeOTv9lcu Lec5DBeHNam1r1eL0Xjx7OUsKLF/S4kpbQ68VSmKfWzMfYOSbjkhm1eqEoDI3O7oAYy8 q3XS/nW0uUIXqf+xI0+yTyMRbhwqxMGygm2+r6lyFPd7skkGqvTII+BDM1MNpsOdzIxL WAxofuxEB8y8VAlTle+C91y+CEQjBLkZoXAcyOVF71Ai9KhGwtIjJbOM4KoslQ/wWDuS VhRg== X-Gm-Message-State: AOAM531y283rAq53DP7kbQd0MOiZpMQPXQRgQfLMVeUvaklOCfIpBJlx uPCnALPGUfPChnHRLaX2x8qDOJ3P1RApuaTDfUo5Cf+M X-Google-Smtp-Source: ABdhPJxJynFIdDdw8WLwX78CuVDxPCXjoYerlKPcQgoUvKtBz33duPU6SkPjZhqsVQVh3gAYOkLaM7Lfa7uCDsgomx4= X-Received: by 2002:a4a:e544:: with SMTP id s4mr634719oot.74.1600883575103; Wed, 23 Sep 2020 10:52:55 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Wed, 23 Sep 2020 11:52:43 -0600 Message-ID: Subject: Re: RFC: copy_file_range(3) To: Rick Macklem Cc: FreeBSD Hackers , Konstantin Belousov X-Rspamd-Queue-Id: 4BxQkc4VqWz3y9s X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.161.48 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-2.36 / 15.00]; RCVD_TLS_ALL(0.00)[]; ARC_NA(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17]; NEURAL_HAM_LONG(-1.00)[-1.001]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_MEDIUM(-0.90)[-0.905]; RWL_MAILSPIKE_GOOD(0.00)[209.85.161.48:from]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-0.45)[-0.451]; RCVD_IN_DNSWL_NONE(0.00)[209.85.161.48:from]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; MAILMAN_DEST(0.00)[freebsd-hackers] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.33 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 23 Sep 2020 17:52:57 -0000 On Wed, Sep 23, 2020 at 9:08 AM Rick Macklem wrote: > Rick Macklem wrote: > >Alan Somers wrote: > >[lots of stuff snipped] > >>1) In order to quickly respond to a signal, a program must use a modest > len with >>copy_file_range > >For the programs you have mentioned, I think the only signal handling > would > >be termination (C or SIGTERM if you prefer). > >I'm not sure what is a reasonable response time for this. > >I'd like to hear comments from others? > >- 1sec, less than 1sec, a few seconds, ... > > > >> 2) If a hole is larger than len, that will cause > vn_generic_copy_file_range to > >> truncate the output file to the middle of the hole. Then, in the next > invocation, > >> truncate it again to a larger size. > >> 3) The result is a file that is not as sparse as the original. > >Yes. So, the trick is to use the largest "len" you can live with, given > how long you > >are willing to wait for signal processing. > > > >> For example, on UFS: > >> $ truncate -s 1g sparsefile > >Not a very interesting sparse file. I wrote a little program to create > one. > >> $ cp sparsefile sparsefile2 > >> $ du -sh sparsefile* > >> 96K sparsefile > >> 32M sparsefile2 > Btw, this happens because, at least for UFS (not sure about other file > systems), if you grow a file's size via VOP_SETATTR() of size, it > allocates a > block at the new EOF, even though no data has been written there. > --> This results in one block being allocated at the end of the range used > for a copy_file_range() call, if that file offset is within a hole. > --> The larger the "len" argument, the less frequently it will occur. > > >> > >> My idea for a userland wrapper would solve this problem by using > >> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use > copy_file_range for > >> everything else with a modest len. Alternatively, we could eliminate > the need for > >> the wrapper by enabling copy_file_range for every file system, and > making > >> vn_generic_copy_file_range interruptible, so copy_file_range can be > called with > >> large len without penalizing signal handling performance. > > > >Well, I ran some quick benchmarks using the attached programs, plus "cp" > both > >before and with your copy_file_range() patch. > >copya - Does what I think your plan is above, with a limit of 2Mbytes for > "len". > >copyb -Just uses copy_file_range() with 128Mbytes for "len". > > > >I first created the sparse file with createsparse.c. It is admittedly a > worst case, > >creating alternating holes and data blocks of the minimum size supported > by > >the file system. (I ran it on a UFS file system created with defaults, so > the minimum > >>hole size is 32Kbytes.) > >The file is 1Gbyte in size with an Allocation size of 524576 ("ls -ls"). > > > >I then ran copya, copyb, old-cp and new-cp. For NFS, I redid the mount > before > >each copy to avoid data caching in the client. > >Here's what I got: > > Elapsed time #RPCs > Allocation size ("ls -ls" on server) > >NFSv4.2 > >copya 39.7sec 16384copy+32768seek 524576 > >copyb 10.2sec 104copy > 524576 > When I ran the tests I had vfs.nfs.maxcopyrange set to 128Mbytes on the > server. However it was still the default of 10Mbytes on the client, > so this test run used 10Mbytes per Copy. (I wondered why it did 104 > Copyies?) > With both set to 128Mbytes I got: > copyb 10.0sec 8copy > 524576 > >old-cp 21.9sec 16384read+16384write 1048864 > >new-cp 10.5sec 1024copy > 524576 > > > >NFSv4.1 > >copya 21.8sec 16384read+16384write 1048864 > >copyb 21.0sec 16384read+16384write 1048864 > >old-cp 21.8sec 16384read+16384write 1048864 > >new-cp 21.4sec 16384read+16384write 1048864 > > > >Local on the UFS file system > >copya 9.2sec n/a > 524576 > This turns out to be just variability in the test. I get 7.9sec->9.2sec > for runs of all three of copya, copyb and new-cp for UFS. > I think it is caching related, since I wasn't unmounting/remounting the > UFS file system between test runs. > >copyb 8.0sec n/a > 524576 > >old-cp 15.9sec n/a > 1048864 > >new-cp 7.9sec n/a > 524576 > > > >So, for a NFSv4.2 mount, using SEEK_DATA/SEEK_HOLE is definitely > >a performance hit, due to all the RPC rtts. > >Your patched "cp" does fine, although a larger "len" reduces the > >RPC count against the server. > >All variants using copy_file_range() retain the holes. > > > >For NFSv4.1, it (not surprisingly) doesn't matter, since only NFSv4.2 > >supports SEEK_DATA/SEEK_HOLE and VOP_COPY_FILE_RANGE(). > > > >For UFS, everything using copy_file_range() works pretty well and > >retains the holes. > > >Although "copya" is guaranteed to retain the holes, it does run noticably > >slower than the others. Not sure why? Does the extra SEEK_DATA/SEEK_HOLE > >syscalls cost that much? > Ignore this. It was just variability in the test runs. > > >The limitation of not using SEEK_DATA/SEEK_HOLE is that you will not > >retain holes that straddle the byte range copied by two subsequent > >copy_file_range(2) calls. > This statement is misleading. These holes are partially retained, but there > will be a block allocated (at least for UFS) at the boundary, due the > property of > growing a file via VOP_SETATTR(size) as noted above. > > >--> This can be minimized by using a large "len", but that large "len" > > results in slower response to signal handling. > I'm going to play with "len" to-day and come up with some numbers > w.r.t. signal handling response time vs the copy_file_range() "len" > argument. > > >I've attached the little programs, so you can play with them. > >(Maybe try different sparse schemes/sizes? It might be fun to > > make the holes/blocks some random multiple of hole size up > > to a limit?) > > > >rick > >ps: In case he isn't reading hackers these days, I've added kib@ > > as a cc. He might know why UFS is 15% slower when SEEK_HOLE > > SEEK_DATA is used. > So it sounds like your main point is that for file systems with special support, copy_file_range(2) is more efficient for many sparse files than SEEK_HOLE/SEEK_DATA. Sure, I buy that. And secondarily, you don't see any reason not to increase the len argument in commands like cp up to somewhere around 128 MB, delaying signal handling for about 1 second on a typical desktop (maybe set it lower on embedded arches). And you think it's fine to allow copy_file_range on devfs, as long as the len argument is clipped at some finite value. If we make all of those changes, are there any other reasons why the write/read fallback path would be needed? -Alan