From owner-freebsd-hackers@freebsd.org Mon Sep 21 04:28:24 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 8AB0A3E5398 for ; Mon, 21 Sep 2020 04:28:24 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-to1can01on0600.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe5d::600]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4BvrzC2yMRz3Swf; Mon, 21 Sep 2020 04:28:23 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=jLUH1Fuq3cASsq0YNcRjCKt3D3WJ3YtndFe8Sq/Y9ReItIPWnOk1HAD71uYsQho9rpJsck7wTLXN9Turwsn5X04xq7OkLmEnHQxzDksrtcnPKGzFmbleZBb2jJCeLHEHtMn17zvKADtrQYFFI4esfsli0U2UTrCfY0u9jpDAxDa2JR9sdek4yE+XShyndrZnvmHZ01wcqINv4ZhYGu0ZAnAc3fh1ZPxy1HEsXHJX7QbAkPw2fvkk3OBI84T7PIX30b+/bHwPCl3spGVTlwwCxFFwztqvILURnjv2R/rIVIuU1Auem7w3+JV24LTc4rZfdt2qMdj5FYXKa/nHh3nDOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Rk/N3vf2nVuytPuqSizw0BpxN/PC5szQZsJKVbkXPBs=; b=ngHoIID9BuPYZWsQ1okfdLz2IGU9EAhnxl3S577qVm5l6cod9ApOe8x2Z/0X6bl6hewm7xMyrImWQbLZj51Mj6aojUEow+9awl8Pi7nZy4R2KrDwI/9MqTq4rN1C/saTLA2N9y/StB06zC0OXlPeypla04+dAF2ALp0N54bYbQ6sMJsY3ZhBUKn0Whzi4FND1bcInncF0yPtr98UpWyZFvYhmbw5k+ftFiLSy+vzSVG3ZkPhD7ldzC1Jpu1SMEnTpK7ArF09ia743pDoiaU3V1NV2pI/W/paxCcPqBRHoE4zuSHLN/anQg3nBg2VsX87oOBZUt7jUij5hGVYqrOqAg== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=Rk/N3vf2nVuytPuqSizw0BpxN/PC5szQZsJKVbkXPBs=; b=Ru1F2YA+LaZ4b2uFbBeOwTrrlWOlouxIgFXo3bSdOuzcLt5xTwsCnKLfJ4pI94gxhwkNZwsdtjRLQ4olOTvAJUvvrIZfytLHNUQ8K/oO5+6PEV8NMqB6RO7z1Be/4jBUFsXrQtbReNEFCqAlzrpkGeg5se8jgf+3R2eTruCGZpc+3CrfBM15QaTXMA3K5quwRuaUqYoVV+stLnC/tQru1ytKtNumkLYO5VD4rYyp+nv9cC0S27wWUdHptpbAnP8lP9wh+CcWWR+FMTFlSUrr134Ip7omcC46K+MWd1KQK9kc/048OhpW3VqWukl4k28MyKVoV7dZSuwKRDl4NJBJZg== Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:24::27) by YTBPR01MB3695.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:26::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3370.16; Mon, 21 Sep 2020 04:28:21 +0000 Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20]) by YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20%6]) with mapi id 15.20.3391.024; Mon, 21 Sep 2020 04:28:21 +0000 From: Rick Macklem To: Alan Somers CC: FreeBSD Hackers Subject: Re: RFC: copy_file_range(3) Thread-Topic: RFC: copy_file_range(3) Thread-Index: AQHWj2Uep8NVOqCTP0KuS7h/nlgLPqlxrARqgAAEfICAAHAflYAAMOYAgAApba8= Date: Mon, 21 Sep 2020 04:28:21 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: d5a9822a-1c45-4d8c-98c6-08d85de6c5da x-ms-traffictypediagnostic: YTBPR01MB3695: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:1247; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: 6VDMQ0jRbVrQHd+d82Sl8Bxs/zU8QzTGHBRaBcpe5nI4v8umosjhUJm42kqf9/kEysDqsiI4S7oOqT/01I4aOh5R04bJVY6f6gMozTB1qn6BpdQm5vaZTUak4jwuAiWbFfzRF6fTdQJXwGM9cu6ZvWgMAgtyuhZuBB7y9rE2HHG7RHZ14zB6O6uJOiKawXFedhd/YZcG2f7NeNqQfE2SjGqOm7y0p7REZSiRIC1YqMBJgbxPWXG78dOr8kP2uofit5JVI5teDFCjdMhmefkqWWMCe/7VZiRnVIpg4M8Dqd0j1ylWyn9SRVARMhGaeTXFNKsQmZyeUslBfL0bqtP0sg== x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(396003)(376002)(136003)(346002)(39850400004)(366004)(8676002)(71200400001)(186003)(6506007)(53546011)(6916009)(5660300002)(7696005)(316002)(786003)(450100002)(33656002)(478600001)(86362001)(66556008)(64756008)(66446008)(66476007)(66946007)(4326008)(76116006)(8936002)(91956017)(52536014)(2906002)(55016002)(9686003)(83380400001); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: 1oJUb68fWf3GbGqTST+68q5dz13YRXWVPpdi7K6v2RCG7VmKu5MZZmdnYLeItUKJKRwofi2yRtq8XncyAWxv+cN3mpK49vy18WEMhr+duevLkWM0x6hRXIhR9Qw+mr3magLcfAq4ojIJxstSTu+387fNCHUpUrDHiRPZA6i84lM6rU2BqUTIhTuQaPPEQ+V6LcgsNQUSp+Bc0LGP0rKvrLuPa7Ki5wFlQrrS8M0EuncPtTIehJXrrfxRkCBqGnp4/QS+yl0FVJu6CONKWVyOMe8CWF4WzvAmzvhDIWh712KsULGAjdwIGwzm1H07dziuVLh0FTPktA+uR+b177yUAL3VYreYk1/xHUgoycP/FvIbC4a9QvLuvOEFVFqVoCDg4Np9bMru8KHs+YTiWqv5PLjKvOSVvG/fNAFaUZ91wqNKubrMUuqZhD0xgxwpIWBQN+7WJ+Mq0q7Ja9brjzHd6TV+9yrQcsCYpUgKootn024NAuwhW+vBXpc8ylCdm0fzlSUnezYTV/eN1zXWTe4OFHtN68LXT0z325UT+bxaAPLB5IA+i3gJmRxSweREfe+rLYkz8qNYKY4HDqbm91IlAcWAbaYjGLUZqze0nyU+EFlPSCb5oegU9MExsSXFPyfne+2Myfp0VII+fdZ8gMG5jQqBZUqS/da9oLRNqaqHOAkF5Phx4+BH4EzRWE2VTDqjwytiMKFZ4J6dFUfc46MiYA== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: d5a9822a-1c45-4d8c-98c6-08d85de6c5da X-MS-Exchange-CrossTenant-originalarrivaltime: 21 Sep 2020 04:28:21.5834 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: +vX0Wep7wQpZTcHmldrkS3QymvhxqoS/MbxQatXjCzuisEH75hpyRnM41aZ9czqgmL3nqCCTiU3q5DYvIMPPUQ== X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTBPR01MB3695 X-Rspamd-Queue-Id: 4BvrzC2yMRz3Swf X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=Ru1F2YA+; dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 2a01:111:f400:fe5d::600 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-5.96 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.996]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a01:111:f400::/48]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_LONG(-1.02)[-1.022]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[uoguelph.ca:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; NEURAL_HAM_SHORT(-0.94)[-0.937]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:8075, ipnet:2a01:111:f000::/36, country:US]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MAILMAN_DEST(0.00)[freebsd-hackers] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 21 Sep 2020 04:28:24 -0000 [I have only indented your most recent comments]=0A= Alan Somers wrote:=0A= On Sun, Sep 20, 2020 at 5:14 PM Rick Macklem > wrote:=0A= Alan Somers wrote:=0A= >On Sun, Sep 20, 2020 at 9:58 AM Rick Macklem >> wrote:=0A= >>Alan Somers wrote:=0A= >>>copy_file_range(2) is nifty, but it has a few sharp edges:=0A= >>>1) Certain file systems don't support it, necessitating a write/read bas= ed=0A= >>>fallback=0A= >>>2) It doesn't handle sparse files as well as SEEK_HOLE/SEEK_DATA=0A= >>>3) It's slightly tricky to both efficiently deal with holes and also=0A= >>>promptly respond to signals=0A= >>>=0A= >>>These problems aren't terribly hard, but it seems to me like most=0A= >>>applications that use copy_file_range would share the exact same=0A= >>>solutions. In particular, I'm thinking about cp(1), dd(1), and=0A= >>>install(8). Those three could benefit from sharing a userland wrapper t= hat=0A= >>>handles the above problems.=0A= >>>=0A= >>>Should we add such a wrapper to libc? If so, what should it be called, = and=0A= >>>should it be public or just private to /usr/src ?=0A= >>There has been a discussion on src-committers which I suggested should=0A= >>be taken to a public mailing list.=0A= >>=0A= >>The basic question is...=0A= >>Whether or not the copy_file_range(2) syscall should be compatible with= =0A= >>the Linux one.=0A= >>When I did the syscall, I tried to make it Linux-compatible, arguing that= =0A= >>Linux is now a de-facto standard.=0A= >>The Linux syscall only works on regular files, which is why Alan's patch = for=0A= >>cp required a "fallback to the old way" for VCHR files like /dev/null.=0A= >>=0A= >>He is considering a wrapper in libc to provide FreeBSD specific semantics= ,=0A= >>which I have no problem with, so long as the naming and man page make=0A= >>it clear that it is not compatible with the Linux syscall.=0A= >>(Personally, I'd prefer a wrapper in libc to making the actual syscall no= n-Linux=0A= >> compatible, but that is just mho.)=0A= >>=0A= >>Hopefully this helps clarify what Alan is asking, rick=0A= >>=0A= >>I don't think the two questions are equivalent. I think that copy_file_r= ange(2) >>ought to work on character devices. Separately, even it does, I = think a userland >>wrapper would still be useful. It would still be able t= o handle sparse files more >>efficiently than the kernel-based vn_generic_c= opy_file_range.=0A= I saw this also stated in your #2 above, but wonder why you think a wrapper= =0A= would handle holes more efficiently.=0A= vn_generic_copy_file_range() does look for holes via SEEK_DATA/SEEK_HOLE=0A= just like a wrapper would and retains them as far as possible. It also look= s=0A= for blocks of all zero bytes for file systems that do not support SEEK_DATA= /=0A= SEEK_HOLE (like NFS versions prior to 4.2) and creates holes for these in= =0A= the output file.=0A= --> The only cases that I am aware of where the holes are not retained are:= =0A= - When the min holesize for the output file is larger than that of the= =0A= input file.=0A= - When the hole straddles the byte range specified for the syscall.=0A= (Or when the hole straddles two copy_file_range(2) syscalls, if you= =0A= prefer.)=0A= =0A= If you are copying the entire file and do not care how long the syscall=0A= takes (which also implies how long it will take for a termination signal=0A= like C to be handled), the most efficient usage is to specify=0A= a "len" argument equal to UINT64_MAX.=0A= --> This will usually copy the whole file in one gulp, although it is not= =0A= guaranteed to copy everything, given the Linux semantics definition= =0A= of it (an NFSv4.2 server can simply choose to copy less, for example= ).=0A= --> This allows the kernel to use whatever block size works efficien= tly=0A= and does not require an allocation of a large userspace buffer= for=0A= the date, nor that the data be copied to/from userspace.=0A= =0A= The problem with doing the whole file in one gulp are:=0A= - A large file can take quite a while and any signal won't be processed unt= il=0A= the gulp is done.=0A= --> If you wrote a program that allocated a 100Gbyte buffer and then=0A= copied a file using read(2)/write(2) with a size of 100Gbytes in a = loop,=0A= you'd end up with the same result.=0A= - As kib@ noted, if the input file never reports EOF (as /dev/zero does),= =0A= then the "one gulp" wouldn't end until storage is exhausted on the=0A= output file(s) device and C wouldn't stop it (since it is one b= ig=0A= syscall).=0A= --> As such, I suggested that, if the syscall is extended to allow VCH= R,=0A= that the "len" argument be clipped at "K Mbytes" for that case t= o=0A= avoid filling the storage device before being able to C ou= t=0A= of it, for this case.=0A= I suppose the answer for #3 is...=0A= - smaller "len" allows for quicker response to signals=0A= but=0A= - smaller "len" results in less efficient use of the syscall.=0A= =0A= Your patch for "cp" seemed fine, but used a small "len" and, as such,=0A= made the use of copy_file_range(2) less efficient.=0A= =0A= All I see the wrapper dong is handling the VCHR case (if the syscall remain= s=0A= as it is now and returns EINVAL to be compatible with Linux) and making=0A= some rather arbitrary choice w.r.t. how big "len" should be.=0A= --> Choosing an appropriate "len" might better be left to the specific use= =0A= case, I think?=0A= =0A= In summary, it's mostly whether VCHR gets handled by the syscall or a=0A= wrapper?=0A= =0A= > 1) In order to quickly respond to a signal, a program must use a modest l= en with > copy_file_range=0A= Does this matter? Or put another way, is a 1-2sec delay in response to C=0A= an issue for "cp".=0A= When kib@ reviewed the syscall, he did not see the delay in signal handling= =0A= a significant problem, noting that it is no different than a large process = core=0A= dumping.=0A= =0A= > 2) If a hole is larger than len, that will cause vn_generic_copy_file_ran= ge to=0A= > truncate the output file to the middle of the hole. Then, in the next in= vocation, =0A= > truncate it again to a larger size.=0A= > 3) The result is a file that is not as sparse as the original.=0A= >=0A= > For example, on UFS:=0A= > $ truncate -s 1g sparsefile=0A= > $ cp sparsefile sparsefile2=0A= > $ du -sh sparsefile*=0A= > 96K sparsefile=0A= > 32M sparsefile2=0A= If you care about maintaining sparseness, a "len" of 100Mbytes or more woul= d=0A= be a reasonable choice. Since "cp" has never maintained sparseness, I didn'= t=0A= suggest such a size when I reviewed your patch for "cp".=0A= --> I/O subsystem performance varies widely, but I think 100Mbytes will lim= it=0A= the delay in signal handling to about 1sec. Isn't that quick enough?= =0A= =0A= > My idea for a userland wrapper would solve this problem by using =0A= > SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_ra= nge for =0A= > everything else with a modest len. Alternatively, we could eliminate the= need for=0A= > the wrapper by enabling copy_file_range for every file system, and making= =0A= > vn_generic_copy_file_range interruptible, so copy_file_range can be calle= d with=0A= > large len without penalizing signal handling performance.=0A= The problem with doing this is it largely defeats the purpose of copy_file_= range().=0A= 1 - What about file systems that do not support SEEK_DATA/SEEK_HOLE.=0A= (All NFS mounts except NFSv4.2 ones against servers that support the= =0A= NFSv4.2 Seek operation are in this category.)=0A= 2 - For NFSv4.2 with servers that support Seek, the copy of an entire file= =0A= can be done via a few (or only one) RPC if you make "len" large and=0A= don't use Seek.=0A= If you combine using Seek with len =3D=3D2Mbytes, then you do a lot mo= re RPCs=0A= with associated overheads and RPC RTT delays. You still avoid moving a= ll=0A= the data across the wire, but you do lose a lot of the performance adv= antage.=0A= =0A= I could have made copy_file_range(2) a lot simpler if the generic code didn= 't=0A= try and maintain holes, but I wanted it to work well for file systems that = did=0A= not support SEEK_DATA/SEEK_HOLE.=0A= =0A= I'd suggest you try patching "cp" to use a 100Mbyte "len" for copy_file_ran= ge()=0A= and test that.=0A= You should fine the sparseness is mostly maintained and that you can = C=0A= out of a large file copy without undue delay.=0A= Then try it over NFS mounts (both v4.2 and v3) for the same large sparse fi= le.=0A= =0A= You can also code up a patched "cp" using SEEK_DATA/SEEK_HOLE and see=0A= how they compare.=0A= =0A= rick=0A= =0A= =0A= -Alan=0A=