Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Jun 2019 22:32:39 +0000
From:      Rick Macklem <rmacklem@uoguelph.ca>
To:        Konstantin Belousov <kostikbel@gmail.com>
Cc:        "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>, Alan Somers <asomers@freebsd.org>, Brooks Davis <brooks@freebsd.org>
Subject:   Re: RFC: should the copy_file_range() syscall be atomic?
Message-ID:  <YQXPR01MB312818DB73CF85DDC4B9BCFDDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM>
In-Reply-To: <20190613220815.GB8697@kib.kiev.ua>
References:  <YQXPR01MB3128EA1CB774995DD5C774FCDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM>, <20190613220815.GB8697@kib.kiev.ua>

next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote:
>On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote:
>> When I first wrote the copy_file_range() syscall, the updating of the fi=
le was atomic,
>> because I locked both vnodes while doing it.
>> kib@ quickly pointed out that this could introduce a LOR/deadlock becaus=
e two
>> userland threads could concurrently do the syscall with the file argumen=
ts reversed.
>>
>> It turns out that locking two VREG vnodes concurrently to do this isn't =
easy and
>> would require the implementation of non-blocking versions of:
>>   vn_rangelock_rlock() and vn_rangelock_wlock()
>>   - I am not sure how difficult doing this is, but I'll admit I'd rather=
 not do this.
>I will help you with this.  It should be relatively easy, even if requires
>some code re-shuffling.
Ok, I'll wait a while to see if others think this should be done. Then if t=
hey do, I'll
take a first stab at it and then pass it along to you.

>>
>> Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA/
>> FIOSEEKHOLE..) calls to find holes in the byte range of the file being c=
opied from.
>If you lock ranges, you still can do ioctl.
Ok, so you are proposing to lock the byte ranges, but lock/unlock the vnode=
s
and do the vn_start_write()/vn_finished_write() for each block read/written=
 so the FIOSEEK* can be done in the loop?

That sounds fine with me, once the concurrent range locks can be acquired s=
afely.

> But in fact it might be better
>to extract wrapper for FIOSEEK* into kernel function, and use it in the
>ioctl handler and in the copy syscall.
It's just a VOP_IOCTL() call, so I don't see a need to wrap it?
(The only trick is that the vnode must be unlocked when you do the call.)

>>
>> Without the vnodes locked, it is possible for other threads to write to =
either of the
>> files concurrently with the copy_file_range(), resulting in an indetermi=
nate results.
>> (cp(1) has never guaranteed atomic copying of files, so is it needed in =
this syscall?)
>> In summary, doing the syscall non-atomically has the advantages of:
>> - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.t=
. locking
>>   wouldn't require changes to this syscall's code.
>> - VOP_IOCTL() could be used to find holes.
>> - No new rangelock primitives need to be added to the system.
>> - If there were some combination of file system/storage unit where copyi=
ng
>>   non-overlapping byte ranges concurrently could result in better perfor=
mance,
>>   then that could be done. (An atomic syscall would serialize them.)
>>
>> The advantage of an atomic syscall would be consistency with write(2) an=
d read(2)
>> behaviour.
>>
>> The following comments are copied from phabricator:
>> kib@ - So you relock range for each chunk ? This defeats the purpose of =
the range locking. Should copy_file_range() be atomic WRT other reads and w=
rites ?
>>
>> asomers@ - That has an unfortunate side effect: copy_file_range is no lo=
nger atomic if you drop the vnode locks and range locks in the middle. It w=
ould be possible for two copy_file_range operations to proceed concurrently=
, leaving the file in a state where each of the operations was partially su=
ccessful. A better solution would be to add rangelock_trywlock and rangeloc=
k_tryrlock. I don't think it would be very hard (testing them, however, cou=
ld be).
>>
>> I don't see anything in the Linux man page w.r.t. atomicity, so I am now=
 asking what
>> others think?
>> (I'll admit I'm biased towards non-atomic, since I have already coded it=
 and can use
>>  the VOP_IOCTL() calls to find the holes in the input file, but...)
>
>AFAIK, linux does not provide atomicity for file reads and writes at all,
>even for normal reads and writes.  I do not want to read Linux code to
>confirm this.
Heh, heh. I don't want to read Linux code to verify this either;-)
I end up reading their NFS code once in a while, but that's enough for me.

However, it does mean that no one will expect atomic behaviour if Linux doe=
sn't
provide it.

One amusing property of this syscall is that it fails if offset+len exceeds=
 EOF.
The example in the Linux man page first stat(2)s the input file to find out=
 its
size and then copies that much using copy_file_range(2) in a loop.
Of course, if the size changes between the stat(2) and the copy_file_range(=
2)
calls, all bets are off.
Also, the Linux man page says it can return having copied less than request=
ed,
although it doesn't explain when/why. It does state that exceeding EOF retu=
rns
EINVAL, so it isn't that it hits EOF on the input file.



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR01MB312818DB73CF85DDC4B9BCFDDDEF0>