Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 7 Jul 2019 21:09:24 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        Jilles Tjoelker <jilles@stack.nl>, "freebsd-current@FreeBSD.org" <freebsd-current@FreeBSD.org>, Alan Somers <asomers@freebsd.org>
Subject:   Re: should a copy_file_range(2) syscall be interrupted via a signal
Message-ID:  <YTXPR0101MB1824340EBC647749FF648A4CDDF70@YTXPR0101MB1824.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <20190705211309.GI47193@kib.kiev.ua>
References:  <YTXPR01MB0285E79DFAAE250FD7A7A181DDF50@YTXPR01MB0285.CANPRD01.PROD.OUTLOOK.COM> <20190705173054.GA30404@stack.nl> <20190705174848.GG47193@kib.kiev.ua> <YTXPR01MB028582701D7C289A57C0121EDDF50@YTXPR01MB0285.CANPRD01.PROD.OUTLOOK.COM>, <20190705211309.GI47193@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote:
>On Fri, Jul 05, 2019 at 08:59:23PM +0000, Rick Macklem wrote:
>> Konstantin Belousov wrote:
>> >On Fri, Jul 05, 2019 at 07:30:54PM +0200, Jilles Tjoelker wrote:
>> >> On Fri, Jul 05, 2019 at 12:28:51AM +0000, Rick Macklem wrote:
>> >> > I have been working on a Linux compatible copy_file_range(2) syscal=
l
>> >> > (the current code can be found at https://reviews.freebsd.org/D2058=
4).
>> >>
>> >> > One outstanding issue is how it should deal with signals. Right now=
, I
>> >> > have vn_start_write() without PCATCH, so that it won't be interrupt=
ed
>> >> > by a signal, but I notice that vn_write() {ie. write syscall } does
>> >> > have PCATCH on vn_start_write() and so does vn_rdwr() when it is
>> >> > called without IO_NODELOCKED.
>> >>
>> >> A regular write() is only interruptible when writing to a terminal,
>> >> pseudo-terminal master, pipe, socket, or, under certain conditions, a
>> >> file on an NFS intr mount. Therefore, applications may not have the c=
ode
>> >> to resume interrupted writes to regular files gracefully.
>> Yes, agreed. Since this syscall only works on VREG vnodes, the only weir=
d cases
>> are NFS (and maybe fuse). I'll let asomers@ address the fuse situation.
>>
>> >>
>> >> > I am thinking that copy_file_range(2) should do this also.
>> >> > However, if it returns an error, it is impossible for the caller to
>> >> > know how much of the data range got copied.
>> >>
>> >> A regular write() returns partial success if interrupted by a signal
>> >> when it has already written something. Therefore, the application can
>> >> resume the operation by adjusting pointers and counts.
>> >>
>> >> Something similar applies to "deterministic" errors like [EFBIG] wher=
e
>> >> the first call will write as far as possible (if this is not nothing)
>> >> successfully and the next attempt will return the error.
>> >>
>> >> > What do you think the copy_file_range(2) code should do?
>> >>
>> >> I'm not sure it should actually be done, but the need for adjusting
>> >> pointers and counts could be avoided with a little extra kernel and l=
ibc
>> >> code. The system call would receive an additional argument pointing t=
o
>> >> an off_t that indicates how many bytes previous calls have already
>> >> written. A libc wrapper would initialize this to 0. With this, the
>> >> system call can be restarted automatically after a signal.
>> >>
>> >> In any case, [EINTR] and the internal ERESTART must not be returned
>> >> unless it is safe to repeat the call with the same (direct) arguments=
.
>> Well, since the copy_file_range(2) syscall is allowed to return fewer by=
tes copied
>> than requested and this doesn't mean EOF, it seems that doing that would
>> achieve the result of allowing an application to call it again.
>> (Basically, it must be used in a loop until the bytes of the range have =
been copied,
>>  since returning fewer bytes copied than requested is a normal outcome.)
>>
>> >BTW, if the syscall is made interruptible, it should be made cancellabl=
e ?
>> Not sure what you mean by "cancellable"? If you mean "terminated by a si=
gnal
>> where there has been no change to the output file, then that could only =
easily be
>> done by returning EINTR before any data has been copied.
>> If you mean something else, then I'd need to know what that is?
>See pthread_setcancelstate(3) for start, but the POSIX 1003.1-2017
>2.9.5 Thread Cancellation is the definitive spec, including the quite
>readable overview.
Ok, thanks. That explains why cancellation of NFSv4.2 Copy operations are d=
efined
the way they are.

>>
>> >I think that PCATCH commonly used for vn_start_write(9) is not the best
>> >decision.  It is safe in the sense explained by Jilles, since its inter=
ruption
>> >only happens at the very beginning of the syscall, but it contradict to=
 the
>> >tradition of write(2) to the local fs being not interruptible.
>> >
>> >I suggest to not make the syscall interruptible by default, and perhaps
>> >only allow it with a flag.  Then you would need to explain that the
>> >syscall is only interruptible between VOPs, it is up to fs to decide if
>> >the VOP_READ/VOP_WRITE is interruptible (e.g. devfs and nfs).
>> This is how it is coded now. The one thing I have noticed is that a
>> copy_file_range() can take a long time (about 2min for 2Gbytes on the ol=
d hardware
>> I test on). This seems like a long delay for <crtl>C when you do that to=
 an application
>> copying a large file. ("cp" and "dd" also take 2min for 2Gbytes, so it i=
sn't a bug
>> in copy_file_range(2). It just introduces a long delay in response to <c=
rtl>C.)
>That long delay is inconvenience but not something that we should spent
>too much time trying to fix. We cause the same delay if program does a
>write(2) of several GB, or when very large process like firefox dumps
>core.

Well, I am happy to leave the patch the way it is now, where the only case
EINTR/ERESTART is returned is if the VOP_xxx() call for the underlying file
system has returned it (such as an NFS mount with "intr" option).

Thanks, rick



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YTXPR0101MB1824340EBC647749FF648A4CDDF70>