From owner-freebsd-hackers@freebsd.org Mon Sep 21 01:40:29 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id AA8913E2303 for ; Mon, 21 Sep 2020 01:40:29 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-oi1-f181.google.com (mail-oi1-f181.google.com [209.85.167.181]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4BvnFS4Lhfz4prX for ; Mon, 21 Sep 2020 01:40:28 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-oi1-f181.google.com with SMTP id n2so15161828oij.1 for ; Sun, 20 Sep 2020 18:40:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=MGzIkGrs37QG3Plh6VbO7iq0QNXuClLuPQ4xhaa16BU=; b=f6Ml+Ih73RCGtA81pDAjNT8/Q5uXoqI7cQExuQTiuMd/rtl+vWTqGvGyIJ31WCimjQ pdGH/wYmvH5e99MdUv9F/ZiIDwabvxUGML0/abg00sKjTTe36RsSw6aK43LsMX242Yhl rUknj26Y/KGR97bQreQaKKqUEVK6eon8dXgKCOHeJD3DVwTJtTKjXAeN7Z3Nlm5TtaIp /4DYUdzRzSzR8nmTdGJY9HtrQT8n63sGjyfbhUucAguAP7J+gqtNi08+YYBRgcXuQKgi Nc59R7WPMxkx+HbAkUnN8p56NENAccvdkeIVfFR3RYPse3abhSgCBa2WyF1MUF6g7Yr5 fujg== X-Gm-Message-State: AOAM530BmTh2gYjRwChtLzUXtRgOZcpEECLp0MKLRcs6iDnRpznuik5Z eeROMjdwaQLzjMQdnW2FIu7At91hCzPn2reCK3g5tTTsstM= X-Google-Smtp-Source: ABdhPJxIDswNKHWgJvmDMYTP6ww6jXodvFarW9zel8JlsN4dOUiTp/ZOgJqxUMVit816qc+lbrwVGC4wdUf2JvOpPdY= X-Received: by 2002:a05:6808:555:: with SMTP id i21mr15566633oig.55.1600652427228; Sun, 20 Sep 2020 18:40:27 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alan Somers Date: Sun, 20 Sep 2020 19:40:16 -0600 Message-ID: Subject: Re: RFC: copy_file_range(3) To: Rick Macklem Cc: FreeBSD Hackers X-Rspamd-Queue-Id: 4BvnFS4Lhfz4prX X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of asomers@gmail.com designates 209.85.167.181 as permitted sender) smtp.mailfrom=asomers@gmail.com X-Spamd-Result: default: False [-3.02 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.01)[-1.007]; RCVD_COUNT_TWO(0.00)[2]; FREEFALL_USER(0.00)[asomers]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:209.85.128.0/17:c]; RCVD_TLS_ALL(0.00)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; DMARC_NA(0.00)[freebsd.org]; NEURAL_HAM_LONG(-1.02)[-1.022]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; NEURAL_HAM_SHORT(-0.99)[-0.990]; RCPT_COUNT_TWO(0.00)[2]; RCVD_IN_DNSWL_NONE(0.00)[209.85.167.181:from]; FORGED_SENDER(0.30)[asomers@freebsd.org,asomers@gmail.com]; RWL_MAILSPIKE_POSSIBLE(0.00)[209.85.167.181:from]; R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:209.85.128.0/17, country:US]; MIME_TRACE(0.00)[0:+,1:+,2:~]; FROM_NEQ_ENVFROM(0.00)[asomers@freebsd.org,asomers@gmail.com]; MAILMAN_DEST(0.00)[freebsd-hackers] Content-Type: text/plain; charset="UTF-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.33 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Sep 2020 01:40:29 -0000 On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem wrote: > Alan Somers wrote: > >On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem > wrote: > >>Alan Somers wrote: > >>>copy_file_range(2) is nifty, but it has a few sharp edges: > >>>1) Certain file systems don't support it, necessitating a write/read > based > >>>fallback > >>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA > >>>3) It's slightly tricky to both efficiently deal with holes and also > >>>promptly respond to signals > >>> > >>>These problems aren't terribly hard, but it seems to me like most > >>>applications that use copy_file_range would share the exact same > >>>solutions. In particular, I'm thinking about cp(1), dd(1), and > >>>install(8). Those three could benefit from sharing a userland wrapper > that > >>>handles the above problems. > >>> > >>>Should we add such a wrapper to libc? If so, what should it be called, > and > >>>should it be public or just private to /usr/src ? > >>There has been a discussion on src-committers which I suggested should > >>be taken to a public mailing list. > >> > >>The basic question is... > >>Whether or not the copy_file_range(2) syscall should be compatible with > >>the Linux one. > >>When I did the syscall, I tried to make it Linux-compatible, arguing that > >>Linux is now a de-facto standard. > >>The Linux syscall only works on regular files, which is why Alan's patch > for > >>cp required a "fallback to the old way" for VCHR files like /dev/null. > >> > >>He is considering a wrapper in libc to provide FreeBSD specific > semantics, > >>which I have no problem with, so long as the naming and man page make > >>it clear that it is not compatible with the Linux syscall. > >>(Personally, I'd prefer a wrapper in libc to making the actual syscall > non-Linux > >> compatible, but that is just mho.) > >> > >>Hopefully this helps clarify what Alan is asking, rick > >> > >>I don't think the two questions are equivalent. I think that > copy_file_range(2) >>ought to work on character devices. Separately, even > it does, I think a userland >>wrapper would still be useful. It would > still be able to handle sparse files more >>efficiently than the > kernel-based vn_generic_copy_file_range. > I saw this also stated in your #2 above, but wonder why you think a wrapper > would handle holes more efficiently. > vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE > just like a wrapper would and retains them as far as possible. It also > looks > for blocks of all zero bytes for file systems that do not support > SEEK_DATA/ > SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in > the output file. > --> The only cases that I am aware of where the holes are not retained are: > - When the min holesize for the output file is larger than that of the > input file. > - When the hole straddles the byte range specified for the syscall. > (Or when the hole straddles two copy_file_range(2) syscalls, if you > prefer.) > > If you are copying the entire file and do not care how long the syscall > takes (which also implies how long it will take for a termination signal > like C to be handled), the most efficient usage is to specify > a "len" argument equal to UINT64_MAX. > --> This will usually copy the whole file in one gulp, although it is not > guaranteed to copy everything, given the Linux semantics definition > of it (an NFSv4.2 server can simply choose to copy less, for > example). > --> This allows the kernel to use whatever block size works > efficiently > and does not require an allocation of a large userspace > buffer for > the date, nor that the data be copied to/from userspace. > > The problem with doing the whole file in one gulp are: > - A large file can take quite a while and any signal won't be processed > until > the gulp is done. > --> If you wrote a program that allocated a 100Gbyte buffer and then > copied a file using read(2)/write(2) with a size of 100Gbytes in a > loop, > you'd end up with the same result. > - As kib@ noted, if the input file never reports EOF (as /dev/zero does), > then the "one gulp" wouldn't end until storage is exhausted on the > output file(s) device and C wouldn't stop it (since it is one > big > syscall). > --> As such, I suggested that, if the syscall is extended to allow > VCHR, > that the "len" argument be clipped at "K Mbytes" for that case > to > avoid filling the storage device before being able to C > out > of it, for this case. > I suppose the answer for #3 is... > - smaller "len" allows for quicker response to signals > but > - smaller "len" results in less efficient use of the syscall. > > Your patch for "cp" seemed fine, but used a small "len" and, as such, > made the use of copy_file_range(2) less efficient. > > All I see the wrapper dong is handling the VCHR case (if the syscall > remains > as it is now and returns EINVAL to be compatible with Linux) and making > some rather arbitrary choice w.r.t. how big "len" should be. > --> Choosing an appropriate "len" might better be left to the specific use > case, I think? > > In summary, it's mostly whether VCHR gets handled by the syscall or a > wrapper? > 1) In order to quickly respond to a signal, a program must use a modest len with copy_file_range 2) If a hole is larger than len, that will cause vn_generic_copy_file_range to truncate the output file to the middle of the hole. Then, in the next invocation, truncate it again to a larger size. 3) The result is a file that is not as sparse as the original. For example, on UFS: $ truncate -s 1g sparsefile $ cp sparsefile sparsefile2 $ du -sh sparsefile* 96K sparsefile 32M sparsefile2 My idea for a userland wrapper would solve this problem by using SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_range for everything else with a modest len. Alternatively, we could eliminate the need for the wrapper by enabling copy_file_range for every file system, and making vn_generic_copy_file_range interruptible, so copy_file_range can be called with large len without penalizing signal handling performance. -Alan