From owner-freebsd-fs@freebsd.org Thu Jun 13 22:25:49 2019 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D395415BEFC8 for ; Thu, 13 Jun 2019 22:25:48 +0000 (UTC) (envelope-from asomers@gmail.com) Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) server-signature RSA-PSS (4096 bits) client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1O1" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4C9FF70EBE; Thu, 13 Jun 2019 22:25:48 +0000 (UTC) (envelope-from asomers@gmail.com) Received: by mail-lj1-f194.google.com with SMTP id h10so352463ljg.0; Thu, 13 Jun 2019 15:25:48 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=pEhTHyhkU7XfhJbWAdyCfrpPrFQbcAvz7qhBtAyfnPw=; b=U7lW1D3FCF1QZgg9bXRWhUUmG4anodjcUms7H99uB/GU0/QLyGGiz3cxVz0CA+68bF KC82ZtRft19vJkyslWHhBHCYKSl9davdsMK4vAKQbcyY1B4Idg8LkOouXnL/lr0aVPnb 0LuvE0qgj35TeAtRIKVCXA4YDHMQ/5mVh0EYLYbkzZJ88GufVC+MEVRhQ7G7oEFIjgVC ZXbnNZzuVkk84H9BT2S38QrkRA8ZUPnUYIgNVtOzk9mPS6V0Kw4e+zOby2Lao9OEnYTD flXQtEQHZIZm+dbVRQclvtYqVmOpA0qw637Fem8NYxRFxI9Yd4kSflLI9hinpcWsSkFw Ospg== X-Gm-Message-State: APjAAAURDz2l6RnzlXXx5QZ8lna6p2xzIyP3t6WtjTmRQIpy8D+CEBLM qwMOAxYx71hzdJOmd65XVasUfLNMSDwJk/MxteigXLzG X-Google-Smtp-Source: APXvYqwVERotSi0n2MApYLTqE3up7Ds5beC68O+ssMEikECsd+U/3k07Imbb/hAABw4jP4WcYNQ+T0vUGuk5vijSCno= X-Received: by 2002:a2e:6e0c:: with SMTP id j12mr25598309ljc.123.1560464741119; Thu, 13 Jun 2019 15:25:41 -0700 (PDT) MIME-Version: 1.0 References: <20190613220815.GB8697@kib.kiev.ua> In-Reply-To: <20190613220815.GB8697@kib.kiev.ua> From: Alan Somers Date: Thu, 13 Jun 2019 16:25:29 -0600 Message-ID: Subject: Re: RFC: should the copy_file_range() syscall be atomic? To: Konstantin Belousov Cc: Rick Macklem , "freebsd-fs@freebsd.org" , Brooks Davis Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 4C9FF70EBE X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.96 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; NEURAL_HAM_SHORT(-0.96)[-0.965,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jun 2019 22:25:49 -0000 On Thu, Jun 13, 2019 at 4:08 PM Konstantin Belousov w= rote: > > On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote: > > When I first wrote the copy_file_range() syscall, the updating of the f= ile was atomic, > > because I locked both vnodes while doing it. > > kib@ quickly pointed out that this could introduce a LOR/deadlock becau= se two > > userland threads could concurrently do the syscall with the file argume= nts reversed. > > > > It turns out that locking two VREG vnodes concurrently to do this isn't= easy and > > would require the implementation of non-blocking versions of: > > vn_rangelock_rlock() and vn_rangelock_wlock() > > - I am not sure how difficult doing this is, but I'll admit I'd rathe= r not do this. > I will help you with this. It should be relatively easy, even if require= s > some code re-shuffling. > > > > > Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA= / > > FIOSEEKHOLE..) calls to find holes in the byte range of the file being = copied from. > If you lock ranges, you still can do ioctl. But in fact it might be bett= er > to extract wrapper for FIOSEEK* into kernel function, and use it in the > ioctl handler and in the copy syscall. > > > > > Without the vnodes locked, it is possible for other threads to write to= either of the > > files concurrently with the copy_file_range(), resulting in an indeterm= inate results. > > (cp(1) has never guaranteed atomic copying of files, so is it needed in= this syscall?) > > In summary, doing the syscall non-atomically has the advantages of: > > - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.= t. locking > > wouldn't require changes to this syscall's code. > > - VOP_IOCTL() could be used to find holes. > > - No new rangelock primitives need to be added to the system. > > - If there were some combination of file system/storage unit where copy= ing > > non-overlapping byte ranges concurrently could result in better perfo= rmance, > > then that could be done. (An atomic syscall would serialize them.) > > > > The advantage of an atomic syscall would be consistency with write(2) a= nd read(2) > > behaviour. > > > > The following comments are copied from phabricator: > > kib@ - So you relock range for each chunk ? This defeats the purpose of= the range locking. Should copy_file_range() be atomic WRT other reads and = writes ? > > > > asomers@ - That has an unfortunate side effect: copy_file_range is no l= onger atomic if you drop the vnode locks and range locks in the middle. It = would be possible for two copy_file_range operations to proceed concurrentl= y, leaving the file in a state where each of the operations was partially s= uccessful. A better solution would be to add rangelock_trywlock and rangelo= ck_tryrlock. I don't think it would be very hard (testing them, however, co= uld be). > > > > I don't see anything in the Linux man page w.r.t. atomicity, so I am no= w asking what > > others think? > > (I'll admit I'm biased towards non-atomic, since I have already coded i= t and can use > > the VOP_IOCTL() calls to find the holes in the input file, but...) > > AFAIK, linux does not provide atomicity for file reads and writes at all, > even for normal reads and writes. I do not want to read Linux code to > confirm this. Really? I find that hard to believe. But sure enough, here's the proof: (they claim it's fixed now) http://man7.org/linux/man-pages/man2/write.2.html . As for copy_file_range, if it's not implemented by the file system, Linux falls back to an implementation based on splice(2). I can't find anything that says whether splice has atomic writes, and in a quick scan of the code I can't find any sign of locking. However, glibc also contains a userland implementation of copy_file_range. It works 8KB at a time and is most definitely NOT atomic. -Alan