From owner-freebsd-hackers@freebsd.org Mon Sep 28 01:43:22 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 6F21A3F13E5 for ; Mon, 28 Sep 2020 01:43:22 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-QB1-obe.outbound.protection.outlook.com (mail-eopbgr660057.outbound.protection.outlook.com [40.107.66.57]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4C04zY3T4pz4g0Q; Mon, 28 Sep 2020 01:43:21 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=DVZMx6Xa7yn5Oa8VG7LJHXW9VZe125182x0KakIv4GlG6d6CbbmSEaESSTF9ZAPAL1l/i00yoEZnOnGdP4ZDsWzd5UVUkz54PIzL2urEQw1kB5LE98LU1dJtYn8AtpWTqSc3U8/0q+FBvDB/9omGKoW8a/LJAq+lOnHerm1skW5jeM69O9qz/td10OIj8NSO6HJjcHT8PhqNXAo+sSu84Sr8mXD/zaAaECqTwx1VuvyjxMSEVRfvaIfT2REt4BA3uDjUMhIUarS5ZXs7DOlF3YcLCqKq0WEO2EORMHWbefSH+WmnxvSU82UEh41Z0JPuTrxs5cF+cG3+ibmovJX64Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=MbP+EQhLElQ9xpd5K6nJazuUdjNMQuEvpzR0LbPeO78=; b=bcClhyNPc1qv0VVSiyIlbhmMtI8GjuhxMMd5FD04zna95uTQPMDIoMBVhqKkuS0syOcSIu/+vmK0JxeS/RnV3yevi7PMcVFzp3+VBiLLSS3kZw8tWjmJqnjIUxLO4T81BVnz1VemV97j7er97SbLAiwi0S2gzqzk6XrXYOPmP5SDk6q9rI2s9Pied3ZtX64XVO24lEnH9TPkibkmAnNiSzsv54pdAPLC+PFfm62cPhJZzW7/6OYt7Q7lHik5RCTT+mmK/X+LHZf1TtWalJVxQ3MAnYFDY2hBlLbDzDT0GGCDFqGMybhv0jnVLP1a39KjWI2oQ/oi3tnu32Ku21XZYw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=MbP+EQhLElQ9xpd5K6nJazuUdjNMQuEvpzR0LbPeO78=; b=DIVSG4ZW8K8fVL9GYdaWZO5PZlevwN8wnn8kWuwNyJJ/atmKBHN6WVTo9aW035IxGdYZiLEhrejY6Tg3C6zXh6wNRaUDZtrte4vvrYK4Aux6rrXumb4Qb760dgWSfpQZJH051aQ55aI8cHD8CheoRLEBasMJ34Gl5WrXDiXHLX2DAZXFe8DJtn/n3D2rKwCG6V1Uvlk8az+/7vSse2Mk9ZmijqLX7HE3eJjaoC8bDe1CioAC9fJJPCb/3YCAMZhzatHxe067MDsTXBO4JObb1cRfaxHf6PM6F6yWifI4BJZfyOQhroBrG+6zF/2vhUZjqdgms0BdMvS4vkQ5cdFMGw== Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:24::27) by YTOPR0101MB1820.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b00:16::12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3412.20; Mon, 28 Sep 2020 01:43:19 +0000 Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20]) by YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20%6]) with mapi id 15.20.3412.029; Mon, 28 Sep 2020 01:43:19 +0000 From: Rick Macklem To: Chris Stephan , Alan Somers CC: FreeBSD Hackers Subject: Re: RFC: copy_file_range(3) Thread-Topic: RFC: copy_file_range(3) Thread-Index: AQHWj2Uep8NVOqCTP0KuS7h/nlgLPqlxrARqgAAEfICAAHAflYAAMOYAgAApba+ACJ95gIAAfXJMgAG51Xw= Date: Mon, 28 Sep 2020 01:43:18 +0000 Message-ID: References: , , , , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 8050a907-92ba-4736-3ba9-08d8634fe05c x-ms-traffictypediagnostic: YTOPR0101MB1820: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:2803; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: jdErE2PkwvnWWiptORpl4DNFfyddVwTfMQAWExNoIv9VrOWzTLMKdJ4Lv12E1cxO35KuatjjpqCsonGVnafoJ9UCFEiMGKhmfFEOj+SNsUorplqA9szbsjBdIPpIeFErcnocxMW+HHm731QBxY1bVxNiywc8Ac1doMJkmS0WWfXCj64BsPwpSoqjzTr+4D8J8uVJUgrEn0AZZnrvDpfV5au27QGWL5hQjSBmD2ASL1sil/LIEY+bOdXFE8Um6WCuWw6ye8nDoXpT0cWKmE30yhDjYOcQTlYF1/pniffMVtjK6n3uCveL5KoIlfruos5lHyx1UpN/KbpThBQA79M0cLj6mmxJJGpjaQgrnRZcVyhNJ/AVw2wcUwnitHLVQh4hJlhbXHT+ug8NDJBZU26IDxB9v1TIoJ5UEbtNA+wMZNtEBNkvUhWoaDTIMsbeo6e4CgGpcNil+Jpa6xQKfnx/dA== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(136003)(346002)(366004)(376002)(396003)(39850400004)(66946007)(66476007)(66556008)(64756008)(66446008)(30864003)(4326008)(91956017)(966005)(2906002)(86362001)(316002)(786003)(110136005)(45080400002)(8936002)(478600001)(33656002)(8676002)(76116006)(5660300002)(6506007)(52536014)(186003)(53546011)(7696005)(83380400001)(55016002)(9686003)(71200400001); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: /QG6UyWYu/BjSOtEbeFrl1ynPwaMV6V+maOlLB3J1Dq6Qu7zfpeJSuaLuCT+YDi275ov+MEwc1EiZWC1NLSrwDHG9MOEABJnGL/gqtCh39eOrdPdsegybD/0CIlLLTZmxbkFvh06Vno2qj4l1dacpPRpwdNMxdyNIyJpKFocv8dFl6uny7LXDSSzktsEMq+QhYDjTvzHU75sM62IGnmzuoL6KHddF/XiP9e3XGlV9HGNnWI2GLAADq05TMAyumU+amRVLLxP43KtjmaolQn9+dSo+XuOfRo7Sv0gu2zWpayvn/sUxma8r+CHvEgYPYdituHE4LCZjRUCtBb/0Fm0LbwF0jTxgon9/g9rXeeEz+4IJ8KZ1OsVcKPqHZE+abb/sJzrLnDRr//eRE23vfruScF4FIIgfa9eU+Ur2duJibnpW+uN/SQFi0UpeYhuCr8LvIpunNnIbEXiHlXQumzxOqR4+HQsNttcENDljjsXfpl8dEdHZK1iGBvQbXVOOhC4fJxQdVYWM5EceweX+YsLv9iV/cEx2zdmWNacoc9T5vlwePWJ4epQlFLqAV6VswC3TP7muCDbrbRBsbOz5oc/L6u1eJVX9ODJYlI1jefSHzkX4VVV5O2MaIbiCbbwFGhlumwDkLyFxvvCcZakUalD0aILzewfiQCep4CJiuC/Yw5jBBUxpipstbAJdnX2a1QMgCAcgTL62Rmhj6toLHhGaA== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="Windows-1252" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 8050a907-92ba-4736-3ba9-08d8634fe05c X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Sep 2020 01:43:18.9782 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: lSUUyauNRUodPQTUfxkTf1EJzVrzz9M6VFPu3DJRWheHBWEiAfdnF6B07I5ueMju+g4cCu1L698pKUXncnZ3mg== X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTOPR0101MB1820 X-Rspamd-Queue-Id: 4C04zY3T4pz4g0Q X-Spamd-Bar: ------ Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=DIVSG4ZW; dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 40.107.66.57 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-6.13 / 15.00]; NEURAL_HAM_MEDIUM(-0.98)[-0.985]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:40.107.0.0/16]; NEURAL_HAM_LONG(-0.99)[-0.991]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[uoguelph.ca:+]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; RCVD_IN_DNSWL_NONE(0.00)[40.107.66.57:from]; NEURAL_HAM_SHORT(-1.16)[-1.158]; FREEMAIL_TO(0.00)[live.com,freebsd.org]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:8075, ipnet:40.104.0.0/14, country:US]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MAILMAN_DEST(0.00)[freebsd-hackers]; RWL_MAILSPIKE_POSSIBLE(0.00)[40.107.66.57:from] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 28 Sep 2020 01:43:22 -0000 I have implemented the 1second timeout and put it up on phabricator as D26571. If anyone wishes to review, please do so. There are also D26569 and D26570, which are a couple of fixes I came up with during testing. These patches do not address the "should it be Linux compatible", which is being discussed on another thread. Thanks for the suggestion w.r.t using a time limit, rick ________________________________________ From: owner-freebsd-hackers@freebsd.org = on behalf of Rick Macklem Sent: Saturday, September 26, 2020 7:22 PM To: Chris Stephan; Alan Somers Cc: FreeBSD Hackers Subject: Re: RFC: copy_file_range(3) Chris Stephan wrote: > New to the list and Late to the discussion. I am thinking increasing the = Len could > cause possible degradation of performance when used on slower or legacy > systems. On the other hand just picking a len and cutting it off at a har= d max > seems crude even with a tunable. Admittedly my naive opinion in this matt= er > ponders, could there be a sysctl tunable that just sets an estimate on ti= meframe > instead of size. As you said 100M is roughly a sec on modem hardware. IOP= S are > already tracked. The inverse of the avg IOPS for the filesystem in questi= on could > be used against this tunable to derive the estimated size limit of the ne= xt > read/write. This would allow the max len within the syscall to both honor= a > timeframe before a signal would be handled and maximize efficiency across= a > large breadth of systems varying in performance. I=92m sure it is more co= mplicated > than I suggest... just tossing my 2c in. Yes. Using time will work for the generic copy case and I think that's a go= od idea. Then we can leave the file system specific cases up to the implementors. (For NFSv4.2, once you send the RPC to the server, the client has no contro= l over how long it takes to reply. The current sysctl that sets a size is still a= bout all the NFSv4.2 code can do.) Thanks for the suggestion, rick Chris Sent from FreeBSD ________________________________ From: owner-freebsd-hackers@freebsd.org = on behalf of Rick Macklem Sent: Sunday, September 20, 2020 11:28:21 PM To: Alan Somers Cc: FreeBSD Hackers Subject: Re: RFC: copy_file_range(3) [I have only indented your most recent comments] Alan Somers wrote: On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem > wrote: Alan Somers wrote: >On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem >> wrote: >>Alan Somers wrote: >>>copy_file_range(2) is nifty, but it has a few sharp edges: >>>1) Certain file systems don't support it, necessitating a write/read bas= ed >>>fallback >>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA >>>3) It's slightly tricky to both efficiently deal with holes and also >>>promptly respond to signals >>> >>>These problems aren't terribly hard, but it seems to me like most >>>applications that use copy_file_range would share the exact same >>>solutions. In particular, I'm thinking about cp(1), dd(1), and >>>install(8). Those three could benefit from sharing a userland wrapper t= hat >>>handles the above problems. >>> >>>Should we add such a wrapper to libc? If so, what should it be called, = and >>>should it be public or just private to /usr/src ? >>There has been a discussion on src-committers which I suggested should >>be taken to a public mailing list. >> >>The basic question is... >>Whether or not the copy_file_range(2) syscall should be compatible with >>the Linux one. >>When I did the syscall, I tried to make it Linux-compatible, arguing that >>Linux is now a de-facto standard. >>The Linux syscall only works on regular files, which is why Alan's patch = for >>cp required a "fallback to the old way" for VCHR files like /dev/null. >> >>He is considering a wrapper in libc to provide FreeBSD specific semantics= , >>which I have no problem with, so long as the naming and man page make >>it clear that it is not compatible with the Linux syscall. >>(Personally, I'd prefer a wrapper in libc to making the actual syscall no= n-Linux >> compatible, but that is just mho.) >> >>Hopefully this helps clarify what Alan is asking, rick >> >>I don't think the two questions are equivalent. I think that copy_file_r= ange(2) >>ought to work on character devices. Separately, even it does, I = think a userland >>wrapper would still be useful. It would still be able t= o handle sparse files more >>efficiently than the kernel-based vn_generic_c= opy_file_range. I saw this also stated in your #2 above, but wonder why you think a wrapper would handle holes more efficiently. vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE just like a wrapper would and retains them as far as possible. It also look= s for blocks of all zero bytes for file systems that do not support SEEK_DATA= / SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in the output file. --> The only cases that I am aware of where the holes are not retained are: - When the min holesize for the output file is larger than that of the input file. - When the hole straddles the byte range specified for the syscall. (Or when the hole straddles two copy_file_range(2) syscalls, if you prefer.) If you are copying the entire file and do not care how long the syscall takes (which also implies how long it will take for a termination signal like C to be handled), the most efficient usage is to specify a "len" argument equal to UINT64_MAX. --> This will usually copy the whole file in one gulp, although it is not guaranteed to copy everything, given the Linux semantics definition of it (an NFSv4.2 server can simply choose to copy less, for example= ). --> This allows the kernel to use whatever block size works efficien= tly and does not require an allocation of a large userspace buffer= for the date, nor that the data be copied to/from userspace. The problem with doing the whole file in one gulp are: - A large file can take quite a while and any signal won't be processed unt= il the gulp is done. --> If you wrote a program that allocated a 100Gbyte buffer and then copied a file using read(2)/write(2) with a size of 100Gbytes in a = loop, you'd end up with the same result. - As kib@ noted, if the input file never reports EOF (as /dev/zero does), then the "one gulp" wouldn't end until storage is exhausted on the output file(s) device and C wouldn't stop it (since it is one b= ig syscall). --> As such, I suggested that, if the syscall is extended to allow VCH= R, that the "len" argument be clipped at "K Mbytes" for that case t= o avoid filling the storage device before being able to C ou= t of it, for this case. I suppose the answer for #3 is... - smaller "len" allows for quicker response to signals but - smaller "len" results in less efficient use of the syscall. Your patch for "cp" seemed fine, but used a small "len" and, as such, made the use of copy_file_range(2) less efficient. All I see the wrapper dong is handling the VCHR case (if the syscall remain= s as it is now and returns EINVAL to be compatible with Linux) and making some rather arbitrary choice w.r.t. how big "len" should be. --> Choosing an appropriate "len" might better be left to the specific use case, I think? In summary, it's mostly whether VCHR gets handled by the syscall or a wrapper? > 1) In order to quickly respond to a signal, a program must use a modest l= en with > copy_file_range Does this matter? Or put another way, is a 1-2sec delay in response to C an issue for "cp". When kib@ reviewed the syscall, he did not see the delay in signal handling a significant problem, noting that it is no different than a large process = core dumping. > 2) If a hole is larger than len, that will cause vn_generic_copy_file_ran= ge to > truncate the output file to the middle of the hole. Then, in the next in= vocation, > truncate it again to a larger size. > 3) The result is a file that is not as sparse as the original. > > For example, on UFS: > $ truncate -s 1g sparsefile > $ cp sparsefile sparsefile2 > $ du -sh sparsefile* > 96K sparsefile > 32M sparsefile2 If you care about maintaining sparseness, a "len" of 100Mbytes or more woul= d be a reasonable choice. Since "cp" has never maintained sparseness, I didn'= t suggest such a size when I reviewed your patch for "cp". --> I/O subsystem performance varies widely, but I think 100Mbytes will lim= it the delay in signal handling to about 1sec. Isn't that quick enough? > My idea for a userland wrapper would solve this problem by using > SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_ra= nge for > everything else with a modest len. Alternatively, we could eliminate the= need for > the wrapper by enabling copy_file_range for every file system, and making > vn_generic_copy_file_range interruptible, so copy_file_range can be calle= d with > large len without penalizing signal handling performance. The problem with doing this is it largely defeats the purpose of copy_file_= range(). 1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE. (All NFS mounts except NFSv4.2 ones against servers that support the NFSv4.2 Seek operation are in this category.) 2 - For NFSv4.2 with servers that support Seek, the copy of an entire file can be done via a few (or only one) RPC if you make "len" large and don't use Seek. If you combine using Seek with len =3D=3D2Mbytes, then you do a lot mo= re RPCs with associated overheads and RPC RTT delays. You still avoid moving a= ll the data across the wire, but you do lose a lot of the performance adv= antage. I could have made copy_file_range(2) a lot simpler if the generic code didn= 't try and maintain holes, but I wanted it to work well for file systems that = did not support SEEK_DATA/SEEK_HOLE. I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_ran= ge() and test that. You should fine the sparseness is mostly maintained and that you can = C out of a large file copy without undue delay. Then try it over NFS mounts (both v4.2 and v3) for the same large sparse fi= le. You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see how they compare. rick -Alan _______________________________________________ freebsd-hackers@freebsd.org mailing list https://nam11.safelinks.protection.outlook.com/?url=3Dhttps%3A%2F%2Flists.f= reebsd.org%2Fmailman%2Flistinfo%2Ffreebsd-hackers&data=3D02%7C01%7C%7C2= 7ea5166cf99415d3bba08d85de6d259%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%= 7C637362593231297450&sdata=3DSfm9MxjQ6MVHgG%2Fw3sghn0hebSFjZo%2FSaUyZ9H= Pyws8%3D&reserved=3D0 To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org" _______________________________________________ freebsd-hackers@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"