From owner-freebsd-current@freebsd.org Fri Jul 5 21:13:19 2019 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D504F15D4B91 for ; Fri, 5 Jul 2019 21:13:18 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) server-signature RSA-PSS (4096 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 14F0E8A013; Fri, 5 Jul 2019 21:13:17 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id x65LD9bv012542 (version=TLSv1.3 cipher=TLS_AES_256_GCM_SHA384 bits=256 verify=NO); Sat, 6 Jul 2019 00:13:12 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua x65LD9bv012542 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id x65LD9QW012541; Sat, 6 Jul 2019 00:13:09 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Sat, 6 Jul 2019 00:13:09 +0300 From: Konstantin Belousov To: Rick Macklem Cc: Jilles Tjoelker , "freebsd-current@FreeBSD.org" , Alan Somers Subject: Re: should a copy_file_range(2) syscall be interrupted via a signal Message-ID: <20190705211309.GI47193@kib.kiev.ua> References: <20190705173054.GA30404@stack.nl> <20190705174848.GG47193@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.12.1 (2019-06-15) X-Spam-Status: No, score=-1.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FORGED_GMAIL_RCVD,FREEMAIL_FROM, NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.2 X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on tom.home X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 05 Jul 2019 21:13:19 -0000 On Fri, Jul 05, 2019 at 08:59:23PM +0000, Rick Macklem wrote: > Konstantin Belousov wrote: > >On Fri, Jul 05, 2019 at 07:30:54PM +0200, Jilles Tjoelker wrote: > >> On Fri, Jul 05, 2019 at 12:28:51AM +0000, Rick Macklem wrote: > >> > I have been working on a Linux compatible copy_file_range(2) syscall > >> > (the current code can be found at https://reviews.freebsd.org/D20584). > >> > >> > One outstanding issue is how it should deal with signals. Right now, I > >> > have vn_start_write() without PCATCH, so that it won't be interrupted > >> > by a signal, but I notice that vn_write() {ie. write syscall } does > >> > have PCATCH on vn_start_write() and so does vn_rdwr() when it is > >> > called without IO_NODELOCKED. > >> > >> A regular write() is only interruptible when writing to a terminal, > >> pseudo-terminal master, pipe, socket, or, under certain conditions, a > >> file on an NFS intr mount. Therefore, applications may not have the code > >> to resume interrupted writes to regular files gracefully. > Yes, agreed. Since this syscall only works on VREG vnodes, the only weird cases > are NFS (and maybe fuse). I'll let asomers@ address the fuse situation. > > >> > >> > I am thinking that copy_file_range(2) should do this also. > >> > However, if it returns an error, it is impossible for the caller to > >> > know how much of the data range got copied. > >> > >> A regular write() returns partial success if interrupted by a signal > >> when it has already written something. Therefore, the application can > >> resume the operation by adjusting pointers and counts. > >> > >> Something similar applies to "deterministic" errors like [EFBIG] where > >> the first call will write as far as possible (if this is not nothing) > >> successfully and the next attempt will return the error. > >> > >> > What do you think the copy_file_range(2) code should do? > >> > >> I'm not sure it should actually be done, but the need for adjusting > >> pointers and counts could be avoided with a little extra kernel and libc > >> code. The system call would receive an additional argument pointing to > >> an off_t that indicates how many bytes previous calls have already > >> written. A libc wrapper would initialize this to 0. With this, the > >> system call can be restarted automatically after a signal. > >> > >> In any case, [EINTR] and the internal ERESTART must not be returned > >> unless it is safe to repeat the call with the same (direct) arguments. > Well, since the copy_file_range(2) syscall is allowed to return fewer bytes copied > than requested and this doesn't mean EOF, it seems that doing that would > achieve the result of allowing an application to call it again. > (Basically, it must be used in a loop until the bytes of the range have been copied, > since returning fewer bytes copied than requested is a normal outcome.) > > >BTW, if the syscall is made interruptible, it should be made cancellable ? > Not sure what you mean by "cancellable"? If you mean "terminated by a signal > where there has been no change to the output file, then that could only easily be > done by returning EINTR before any data has been copied. > If you mean something else, then I'd need to know what that is? See pthread_setcancelstate(3) for start, but the POSIX 1003.1-2017 2.9.5 Thread Cancellation is the definitive spec, including the quite readable overview. > > >I think that PCATCH commonly used for vn_start_write(9) is not the best > >decision. It is safe in the sense explained by Jilles, since its interruption > >only happens at the very beginning of the syscall, but it contradict to the > >tradition of write(2) to the local fs being not interruptible. > > > >I suggest to not make the syscall interruptible by default, and perhaps > >only allow it with a flag. Then you would need to explain that the > >syscall is only interruptible between VOPs, it is up to fs to decide if > >the VOP_READ/VOP_WRITE is interruptible (e.g. devfs and nfs). > This is how it is coded now. The one thing I have noticed is that a > copy_file_range() can take a long time (about 2min for 2Gbytes on the old hardware > I test on). This seems like a long delay for C when you do that to an application > copying a large file. ("cp" and "dd" also take 2min for 2Gbytes, so it isn't a bug > in copy_file_range(2). It just introduces a long delay in response to C.) That long delay is inconvenience but not something that we should spent too much time trying to fix. We cause the same delay if program does a write(2) of several GB, or when very large process like firefox dumps core.