Date: Thu, 13 Jun 2019 22:32:39 +0000 From: Rick Macklem <rmacklem@uoguelph.ca> To: Konstantin Belousov <kostikbel@gmail.com> Cc: "freebsd-fs@freebsd.org" <freebsd-fs@FreeBSD.org>, Alan Somers <asomers@freebsd.org>, Brooks Davis <brooks@freebsd.org> Subject: Re: RFC: should the copy_file_range() syscall be atomic? Message-ID: <YQXPR01MB312818DB73CF85DDC4B9BCFDDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM> In-Reply-To: <20190613220815.GB8697@kib.kiev.ua> References: <YQXPR01MB3128EA1CB774995DD5C774FCDDEF0@YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM>, <20190613220815.GB8697@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
Konstantin Belousov wrote: >On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote: >> When I first wrote the copy_file_range() syscall, the updating of the fi= le was atomic, >> because I locked both vnodes while doing it. >> kib@ quickly pointed out that this could introduce a LOR/deadlock becaus= e two >> userland threads could concurrently do the syscall with the file argumen= ts reversed. >> >> It turns out that locking two VREG vnodes concurrently to do this isn't = easy and >> would require the implementation of non-blocking versions of: >> vn_rangelock_rlock() and vn_rangelock_wlock() >> - I am not sure how difficult doing this is, but I'll admit I'd rather= not do this. >I will help you with this. It should be relatively easy, even if requires >some code re-shuffling. Ok, I'll wait a while to see if others think this should be done. Then if t= hey do, I'll take a first stab at it and then pass it along to you. >> >> Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA/ >> FIOSEEKHOLE..) calls to find holes in the byte range of the file being c= opied from. >If you lock ranges, you still can do ioctl. Ok, so you are proposing to lock the byte ranges, but lock/unlock the vnode= s and do the vn_start_write()/vn_finished_write() for each block read/written= so the FIOSEEK* can be done in the loop? That sounds fine with me, once the concurrent range locks can be acquired s= afely. > But in fact it might be better >to extract wrapper for FIOSEEK* into kernel function, and use it in the >ioctl handler and in the copy syscall. It's just a VOP_IOCTL() call, so I don't see a need to wrap it? (The only trick is that the vnode must be unlocked when you do the call.) >> >> Without the vnodes locked, it is possible for other threads to write to = either of the >> files concurrently with the copy_file_range(), resulting in an indetermi= nate results. >> (cp(1) has never guaranteed atomic copying of files, so is it needed in = this syscall?) >> In summary, doing the syscall non-atomically has the advantages of: >> - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.t= . locking >> wouldn't require changes to this syscall's code. >> - VOP_IOCTL() could be used to find holes. >> - No new rangelock primitives need to be added to the system. >> - If there were some combination of file system/storage unit where copyi= ng >> non-overlapping byte ranges concurrently could result in better perfor= mance, >> then that could be done. (An atomic syscall would serialize them.) >> >> The advantage of an atomic syscall would be consistency with write(2) an= d read(2) >> behaviour. >> >> The following comments are copied from phabricator: >> kib@ - So you relock range for each chunk ? This defeats the purpose of = the range locking. Should copy_file_range() be atomic WRT other reads and w= rites ? >> >> asomers@ - That has an unfortunate side effect: copy_file_range is no lo= nger atomic if you drop the vnode locks and range locks in the middle. It w= ould be possible for two copy_file_range operations to proceed concurrently= , leaving the file in a state where each of the operations was partially su= ccessful. A better solution would be to add rangelock_trywlock and rangeloc= k_tryrlock. I don't think it would be very hard (testing them, however, cou= ld be). >> >> I don't see anything in the Linux man page w.r.t. atomicity, so I am now= asking what >> others think? >> (I'll admit I'm biased towards non-atomic, since I have already coded it= and can use >> the VOP_IOCTL() calls to find the holes in the input file, but...) > >AFAIK, linux does not provide atomicity for file reads and writes at all, >even for normal reads and writes. I do not want to read Linux code to >confirm this. Heh, heh. I don't want to read Linux code to verify this either;-) I end up reading their NFS code once in a while, but that's enough for me. However, it does mean that no one will expect atomic behaviour if Linux doe= sn't provide it. One amusing property of this syscall is that it fails if offset+len exceeds= EOF. The example in the Linux man page first stat(2)s the input file to find out= its size and then copies that much using copy_file_range(2) in a loop. Of course, if the size changes between the stat(2) and the copy_file_range(= 2) calls, all bets are off. Also, the Linux man page says it can return having copied less than request= ed, although it doesn't explain when/why. It does state that exceeding EOF retu= rns EINVAL, so it isn't that it hits EOF on the input file.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YQXPR01MB312818DB73CF85DDC4B9BCFDDDEF0>