From owner-freebsd-fs@freebsd.org Thu Jun 13 22:32:42 2019 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E247615BF27B for ; Thu, 13 Jun 2019 22:32:41 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-eopbgr670052.outbound.protection.outlook.com [40.107.67.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7C65A712F6; Thu, 13 Jun 2019 22:32:41 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM (52.132.93.160) by YQXPR01MB3158.CANPRD01.PROD.OUTLOOK.COM (52.132.90.14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1987.12; Thu, 13 Jun 2019 22:32:39 +0000 Received: from YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM ([fe80::4882:9001:520a:7453]) by YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM ([fe80::4882:9001:520a:7453%5]) with mapi id 15.20.1987.012; Thu, 13 Jun 2019 22:32:39 +0000 From: Rick Macklem To: Konstantin Belousov CC: "freebsd-fs@freebsd.org" , Alan Somers , Brooks Davis Subject: Re: RFC: should the copy_file_range() syscall be atomic? Thread-Topic: RFC: should the copy_file_range() syscall be atomic? Thread-Index: AQHVIi1PBuOR1YGquUCvHoiFxEOa/aaaJKSAgAABx3U= Date: Thu, 13 Jun 2019 22:32:39 +0000 Message-ID: References: , <20190613220815.GB8697@kib.kiev.ua> In-Reply-To: <20190613220815.GB8697@kib.kiev.ua> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: d4fb0630-d7a3-467c-72db-08d6f04f0ad8 x-microsoft-antispam: BCL:0; PCL:0; RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600148)(711020)(4605104)(1401327)(2017052603328)(7193020); SRVR:YQXPR01MB3158; x-ms-traffictypediagnostic: YQXPR01MB3158: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:10000; x-forefront-prvs: 0067A8BA2A x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(136003)(376002)(396003)(366004)(346002)(39860400002)(199004)(189003)(6246003)(186003)(6436002)(53936002)(478600001)(99286004)(33656002)(229853002)(102836004)(14454004)(55016002)(6506007)(54906003)(9686003)(316002)(786003)(14444005)(256004)(74316002)(446003)(11346002)(2906002)(76176011)(305945005)(74482002)(5660300002)(6916009)(81156014)(86362001)(68736007)(7696005)(1411001)(476003)(486006)(8936002)(71200400001)(71190400001)(25786009)(76116006)(66446008)(66476007)(64756008)(66946007)(66556008)(73956011)(52536014)(4326008)(46003)(81166006)(8676002); DIR:OUT; SFP:1101; SCL:1; SRVR:YQXPR01MB3158; H:YQXPR01MB3128.CANPRD01.PROD.OUTLOOK.COM; FPR:; SPF:None; LANG:en; PTR:InfoNoRecords; A:1; MX:1; received-spf: None (protection.outlook.com: uoguelph.ca does not designate permitted sender hosts) x-ms-exchange-senderadcheck: 1 x-microsoft-antispam-message-info: YsKjZJ2OWuMfSfpur+l2kH8vN8dACRzR3X8U3eWtrTCjAjnXc7PG0Xe94bkqBrBKq+jtzg+DxKMDfaJ2awVOAQf1GgTCyOS6fn3mMGmbs0EEpRS/YIfi2DJnMtt7BH3X7iBpl/JcSMqKKRK/uUVlKHv+csXXStDFngj/WMFvTU5Knx6g3sw65WkitesK/3TzQX7n7xK81fh5TY3xrEmaXwY6bWAzTkvJBgyNkYbJRb5ieoK+auR+IsJbBmOR8LW2gfFUirE0ZxAqAL5C172uagSvQTCv2PH6NXOIOokfIyGhLvFN0MK8NKrpmItJabVgQEG8bIjqK1ZzuiBsY+d/NthEwdB7XCFVvLSEGIWsywYvGLrN8kDehHmM1oOHoOepf9AcLl+tMZEeKq8e5/+SwlQNC1kVkAUeDg+48OrGZuc= Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-Network-Message-Id: d4fb0630-d7a3-467c-72db-08d6f04f0ad8 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Jun 2019 22:32:39.3224 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: rmacklem@uoguelph.ca X-MS-Exchange-Transport-CrossTenantHeadersStamped: YQXPR01MB3158 X-Rspamd-Queue-Id: 7C65A712F6 X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org X-Spamd-Result: default: False [-6.97 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; NEURAL_HAM_LONG(-1.00)[-1.000,0]; REPLY(-4.00)[]; NEURAL_HAM_SHORT(-0.97)[-0.968,0] X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Jun 2019 22:32:42 -0000 Konstantin Belousov wrote: >On Thu, Jun 13, 2019 at 09:44:01PM +0000, Rick Macklem wrote: >> When I first wrote the copy_file_range() syscall, the updating of the fi= le was atomic, >> because I locked both vnodes while doing it. >> kib@ quickly pointed out that this could introduce a LOR/deadlock becaus= e two >> userland threads could concurrently do the syscall with the file argumen= ts reversed. >> >> It turns out that locking two VREG vnodes concurrently to do this isn't = easy and >> would require the implementation of non-blocking versions of: >> vn_rangelock_rlock() and vn_rangelock_wlock() >> - I am not sure how difficult doing this is, but I'll admit I'd rather= not do this. >I will help you with this. It should be relatively easy, even if requires >some code re-shuffling. Ok, I'll wait a while to see if others think this should be done. Then if t= hey do, I'll take a first stab at it and then pass it along to you. >> >> Also, having the vnodes locked precludes use of VOP_IOCTL(..FIOSEEKDATA/ >> FIOSEEKHOLE..) calls to find holes in the byte range of the file being c= opied from. >If you lock ranges, you still can do ioctl. Ok, so you are proposing to lock the byte ranges, but lock/unlock the vnode= s and do the vn_start_write()/vn_finished_write() for each block read/written= so the FIOSEEK* can be done in the loop? That sounds fine with me, once the concurrent range locks can be acquired s= afely. > But in fact it might be better >to extract wrapper for FIOSEEK* into kernel function, and use it in the >ioctl handler and in the copy syscall. It's just a VOP_IOCTL() call, so I don't see a need to wrap it? (The only trick is that the vnode must be unlocked when you do the call.) >> >> Without the vnodes locked, it is possible for other threads to write to = either of the >> files concurrently with the copy_file_range(), resulting in an indetermi= nate results. >> (cp(1) has never guaranteed atomic copying of files, so is it needed in = this syscall?) >> In summary, doing the syscall non-atomically has the advantages of: >> - vn_rdwr() takes care of the vnode locking issues, so any changes w.r.t= . locking >> wouldn't require changes to this syscall's code. >> - VOP_IOCTL() could be used to find holes. >> - No new rangelock primitives need to be added to the system. >> - If there were some combination of file system/storage unit where copyi= ng >> non-overlapping byte ranges concurrently could result in better perfor= mance, >> then that could be done. (An atomic syscall would serialize them.) >> >> The advantage of an atomic syscall would be consistency with write(2) an= d read(2) >> behaviour. >> >> The following comments are copied from phabricator: >> kib@ - So you relock range for each chunk ? This defeats the purpose of = the range locking. Should copy_file_range() be atomic WRT other reads and w= rites ? >> >> asomers@ - That has an unfortunate side effect: copy_file_range is no lo= nger atomic if you drop the vnode locks and range locks in the middle. It w= ould be possible for two copy_file_range operations to proceed concurrently= , leaving the file in a state where each of the operations was partially su= ccessful. A better solution would be to add rangelock_trywlock and rangeloc= k_tryrlock. I don't think it would be very hard (testing them, however, cou= ld be). >> >> I don't see anything in the Linux man page w.r.t. atomicity, so I am now= asking what >> others think? >> (I'll admit I'm biased towards non-atomic, since I have already coded it= and can use >> the VOP_IOCTL() calls to find the holes in the input file, but...) > >AFAIK, linux does not provide atomicity for file reads and writes at all, >even for normal reads and writes. I do not want to read Linux code to >confirm this. Heh, heh. I don't want to read Linux code to verify this either;-) I end up reading their NFS code once in a while, but that's enough for me. However, it does mean that no one will expect atomic behaviour if Linux doe= sn't provide it. One amusing property of this syscall is that it fails if offset+len exceeds= EOF. The example in the Linux man page first stat(2)s the input file to find out= its size and then copies that much using copy_file_range(2) in a loop. Of course, if the size changes between the stat(2) and the copy_file_range(= 2) calls, all bets are off. Also, the Linux man page says it can return having copied less than request= ed, although it doesn't explain when/why. It does state that exceeding EOF retu= rns EINVAL, so it isn't that it hits EOF on the input file.