From owner-freebsd-hackers@freebsd.org Fri Sep 25 16:26:47 2020 Return-Path: Delivered-To: freebsd-hackers@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 7B1863FBF43 for ; Fri, 25 Sep 2020 16:26:47 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) Received: from CAN01-TO1-obe.outbound.protection.outlook.com (mail-to1can01on0607.outbound.protection.outlook.com [IPv6:2a01:111:f400:fe5d::607]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mail.protection.outlook.com", Issuer "GlobalSign Organization Validation CA - SHA256 - G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4ByckG1yddz4Brq; Fri, 25 Sep 2020 16:26:45 +0000 (UTC) (envelope-from rmacklem@uoguelph.ca) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=W2Ivkd3vLzmuP3cM9hIgVRX+/kik3qKHGRpL3Lw2pEWXOpzkv//ksDtPLReWl2qibGqCr8oLkeu1or27eWq1VsXiBheJHdR1/jz25b2mxY+UqKQBOazLn50wliu4OBK+/MyykFOZxsnRfKuQsAhvjQ2mAc5StK7YhgmeA7P+kTpS25paiHVBFmZSLXDG3Hw2ILq22bQ3iRxUJhYd6iPOy9JPcl9bjtUunWe6THirTmCqY2hnrCTdo2qWstZd28+hMO/GpkadmlVPQuspThSaJ8nQFrTMPf6TBz9n5Y7gNoqldJ36uhdSAqvZYps7oLaafgL5gFs7erwsFDrRakm4Lg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=91Hh+XuAsjz/Mbqvc+elDIvh1f5ZWzVobjpfx/h4waw=; b=VApvU5BixZsFsFEMq/R/n+EHsZJYOUXQAsdUiV5Ypy05sLvyZfYg94wlZQrYGu4EKjBPz6bEbdRgnp42ng0FC8JgmhPyYbcHezIjmpmpFWMMtm9sojaEt0jz6n9TFa7eXslFpQye52JZ5UNBHSgfdVp4Hm35KXFgc7LUJ1DStbYGOjaq8V/mo0wMppKds7g8wDSFR3j+5fdalXMzfDdHjS7+2E5MoDgSy61B7eItJXIFb7fOLpHrdom4noXbepi5EFsyQ7sjOhoMveAgb9XDBZbnN/FIlqaOeUTJ8oQAtoG5Mubtk3xq62/aMazR2cvsGNT7nps5EMBqcrWaTk3vrQ== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=uoguelph.ca; dmarc=pass action=none header.from=uoguelph.ca; dkim=pass header.d=uoguelph.ca; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uoguelph.ca; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=91Hh+XuAsjz/Mbqvc+elDIvh1f5ZWzVobjpfx/h4waw=; b=bSH1bksA5WlO1HmX1Ol5WJr7i/uMN2ME9TyuykmAq+tcV8+knga0xurbIzoWCyGwINgFo28UF0mbI5Xh6pYk21Az/PGiv941OBmFFmUPNcvWTfDjjdho5lmRkQYrrHXnkuXdK1TaRv6unPHE+gm0oNkYIHCJdHB/A6RTUXPcv+ZTw5x5Qt8o1W++aCd5r8s+KIVFfTLoCC6AK3jwWaFJ5nmrU57sUYQC4fFT8pyfTnIJgghAU2N//7QZzE9GkrRIvvw36pQl/CqCvDALN2C0UegoDOvE3+osQvqe4LXa3DRT75ITrnEarGDewOmPlRGCujen9slraSS69cHewlA5Hg== Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b01:24::27) by YTXPR0101MB1072.CANPRD01.PROD.OUTLOOK.COM (2603:10b6:b00:4::23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3412.20; Fri, 25 Sep 2020 16:26:43 +0000 Received: from YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20]) by YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM ([fe80::687f:d85a:a0a3:bd20%6]) with mapi id 15.20.3412.024; Fri, 25 Sep 2020 16:26:43 +0000 From: Rick Macklem To: Alan Somers CC: FreeBSD Hackers , Konstantin Belousov Subject: Re: RFC: copy_file_range(3) Thread-Topic: RFC: copy_file_range(3) Thread-Index: AQHWj2Uep8NVOqCTP0KuS7h/nlgLPqlxrARqgAAEfICAAHAflYAAMOYAgAMOVByAAPKLRYAAM36AgAMJ89Y= Date: Fri, 25 Sep 2020 16:26:43 +0000 Message-ID: References: , In-Reply-To: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 79bc0b8a-9a5b-4c4b-9e5c-08d8616fca67 x-ms-traffictypediagnostic: YTXPR0101MB1072: x-microsoft-antispam-prvs: x-ms-oob-tlc-oobclassifiers: OLM:9508; x-ms-exchange-senderadcheck: 1 x-microsoft-antispam: BCL:0; x-microsoft-antispam-message-info: CHczfN8iToRE3yfuhMNG5WirCB/2U0ryCoNP0/a6IKW6U+rRq/kL1JDpFIfCZZtCI2a6BBbAQqsIrsuXJQv1ejsNzqSR3d8ViFhY34Pa51ASwKYaZdaZ2qq/vvGRG7s9VH50WaP9PmR1viMNJD1EHHWA3z2PJa6A0mK+92DlL+Y0D0ujTWBvlVKX8+JGZyk4DpKuAq5fjRFuXUUEkOQOGFRmHt8JWxAu3Apc+CFOnx3obV0FBbremp8xgHR1iB4gD5985/1X2U/s99jdCgcQUMVUDGqeEqFZbLkXYm21DULV4aHGVbQPnXsuKnMP59dtrPfi6yMV267Fn8Mi7TphXvRsD0s56STkeKNfJWGlyIjrRcnIlq0xxzzad/1HMhNs x-forefront-antispam-report: CIP:255.255.255.255; CTRY:; LANG:en; SCL:1; SRV:; IPV:NLI; SFV:NSPM; H:YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM; PTR:; CAT:NONE; SFS:(136003)(396003)(366004)(346002)(39860400002)(376002)(33656002)(478600001)(76116006)(6916009)(86362001)(4326008)(66446008)(64756008)(450100002)(66556008)(66476007)(66946007)(91956017)(186003)(52536014)(71200400001)(55016002)(8936002)(316002)(53546011)(9686003)(6506007)(8676002)(2906002)(786003)(5660300002)(83380400001)(7696005)(54906003); DIR:OUT; SFP:1101; x-ms-exchange-antispam-messagedata: piimSfwMhFJvOcB6t/9U8vqDZFg8mII4H8iDs/dmArHZwM+k6yjLXRG9h3dENDwlgaUEXZkjmmUCAO3ECLxbrboVIMolPLPWQmYQyegj9ad32OVT0KUkTv69g8MNXBTOcMp74usnINFGzeZWmHeK8+PH5rtO2QdTMPY3gDlBdh8nyphia8OeB4SB+VYXO+BVKBV19lN4FhzIn+/DXE0EkkH7SnZrRyu0ADDTha3pwrQHJngROLYSEj15nYRoO0GMUfj0qyMiWA9GczXpTMZT6fvymtWWEAAxoIbN8U7SKw0/243a4G7Ky4ztQz2vOk5wwphv0H2hHik/Yn5QLjx5KZjnnLJlzsoItr2/Z6rdATNNQYeHR40ibfGqBG2t6Jt/61H/DqmZuCBPo/vH+Z5/2DnKiUVrRqjsVFcVPw/Sd9SrGxLmTT+1YdkGQ95XrYMSBnqqJhyJeL5hZ5m5ONAjj6th/jxKHOfqekqS3AAFQo9agKP4/Gy6H0lULuDT1zxjDDqNFBsy0MT2KgZKu8GigdG7smat+wEcmMjnK1MjJeybMx4T3SPXCSvh0+0ylo4FHGqUaAeFxCQMAm1xJzQYHnd8RF4ogOzMtHKvz7bAzD+YKon2gtdsfuSfvM2VBlzDEp9ar9OGl5bA/Z36Ajb1Q1CU0TB71ZYRCaTm8KNnQALWRFNxcRDY6nWKssevMNn/nRlD3+1bMiyzE+tumylGCA== x-ms-exchange-transport-forked: True Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: uoguelph.ca X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: YTBPR01MB3966.CANPRD01.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-Network-Message-Id: 79bc0b8a-9a5b-4c4b-9e5c-08d8616fca67 X-MS-Exchange-CrossTenant-originalarrivaltime: 25 Sep 2020 16:26:43.7168 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: be62a12b-2cad-49a1-a5fa-85f4f3156a7d X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: cHuIjywY/+XJqw6J8IxRNUejFJMWF/w4uarHalARFxnfC3hm5q4ghPTzeC6uNdKqlvBwFkOvCq8pjpvmRH2FOA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: YTXPR0101MB1072 X-Rspamd-Queue-Id: 4ByckG1yddz4Brq X-Spamd-Bar: ----- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=uoguelph.ca header.s=selector1 header.b=bSH1bksA; dmarc=pass (policy=none) header.from=uoguelph.ca; spf=pass (mx1.freebsd.org: domain of rmacklem@uoguelph.ca designates 2a01:111:f400:fe5d::607 as permitted sender) smtp.mailfrom=rmacklem@uoguelph.ca X-Spamd-Result: default: False [-5.56 / 15.00]; NEURAL_HAM_MEDIUM(-1.01)[-1.006]; R_DKIM_ALLOW(-0.20)[uoguelph.ca:s=selector1]; FREEFALL_USER(0.00)[rmacklem]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2a01:111:f400::/48]; MIME_GOOD(-0.10)[text/plain]; NEURAL_HAM_LONG(-1.01)[-1.014]; DWL_DNSWL_LOW(-1.00)[uoguelph.ca:dkim]; RCVD_COUNT_THREE(0.00)[3]; TO_DN_ALL(0.00)[]; DKIM_TRACE(0.00)[uoguelph.ca:+]; DMARC_POLICY_ALLOW(-0.50)[uoguelph.ca,none]; NEURAL_HAM_SHORT(-0.54)[-0.536]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_TLS_LAST(0.00)[]; ASN(0.00)[asn:8075, ipnet:2a01:111:f000::/36, country:US]; ARC_ALLOW(-1.00)[microsoft.com:s=arcselector9901:i=1]; MAILMAN_DEST(0.00)[freebsd-hackers] X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.33 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 25 Sep 2020 16:26:47 -0000 [the indentation seems to be a bit messed up, so I'll skip to near the end.= ..]=0A= On Wed, Sep 23, 2020 at 9:08 AM Rick Macklem > wrote:=0A= Rick Macklem wrote:=0A= >Alan Somers wrote:=0A= >[lots of stuff snipped]=0A= >>1) In order to quickly respond to a signal, a program must use a modest l= en with >>copy_file_range=0A= >For the programs you have mentioned, I think the only signal handling woul= d=0A= >be termination (C or SIGTERM if you prefer).=0A= >I'm not sure what is a reasonable response time for this.=0A= >I'd like to hear comments from others?=0A= >- 1sec, less than 1sec, a few seconds, ...=0A= >=0A= >> 2) If a hole is larger than len, that will cause vn_generic_copy_file_ra= nge to=0A= >> truncate the output file to the middle of the hole. Then, in the next i= nvocation,=0A= >> truncate it again to a larger size.=0A= >> 3) The result is a file that is not as sparse as the original.=0A= >Yes. So, the trick is to use the largest "len" you can live with, given ho= w long you=0A= >are willing to wait for signal processing.=0A= >=0A= >> For example, on UFS:=0A= >> $ truncate -s 1g sparsefile=0A= >Not a very interesting sparse file. I wrote a little program to create one= .=0A= >> $ cp sparsefile sparsefile2=0A= >> $ du -sh sparsefile*=0A= >> 96K sparsefile=0A= >> 32M sparsefile2=0A= Btw, this happens because, at least for UFS (not sure about other file=0A= systems), if you grow a file's size via VOP_SETATTR() of size, it allocates= a=0A= block at the new EOF, even though no data has been written there.=0A= --> This results in one block being allocated at the end of the range used= =0A= for a copy_file_range() call, if that file offset is within a hole.=0A= --> The larger the "len" argument, the less frequently it will occur.= =0A= =0A= >>=0A= >> My idea for a userland wrapper would solve this problem by using=0A= >> SEEK_HOLE/SEEK_DATA to copy holes in their entirety, and use copy_file_r= ange for=0A= >> everything else with a modest len. Alternatively, we could eliminate th= e need for=0A= >> the wrapper by enabling copy_file_range for every file system, and makin= g=0A= >> vn_generic_copy_file_range interruptible, so copy_file_range can be call= ed with=0A= >> large len without penalizing signal handling performance.=0A= >=0A= >Well, I ran some quick benchmarks using the attached programs, plus "cp" b= oth=0A= >before and with your copy_file_range() patch.=0A= >copya - Does what I think your plan is above, with a limit of 2Mbytes for = "len".=0A= >copyb -Just uses copy_file_range() with 128Mbytes for "len".=0A= >=0A= >I first created the sparse file with createsparse.c. It is admittedly a wo= rst case,=0A= >creating alternating holes and data blocks of the minimum size supported b= y=0A= >the file system. (I ran it on a UFS file system created with defaults, so = the minimum=0A= >>hole size is 32Kbytes.)=0A= >The file is 1Gbyte in size with an Allocation size of 524576 ("ls -ls").= =0A= >=0A= >I then ran copya, copyb, old-cp and new-cp. For NFS, I redid the mount bef= ore=0A= >each copy to avoid data caching in the client.=0A= >Here's what I got:=0A= > Elapsed time #RPCs Alloca= tion size ("ls -ls" on server)=0A= >NFSv4.2=0A= >copya 39.7sec 16384copy+32768seek 524576=0A= >copyb 10.2sec 104copy 52= 4576=0A= When I ran the tests I had vfs.nfs.maxcopyrange set to 128Mbytes on the=0A= server. However it was still the default of 10Mbytes on the client,=0A= so this test run used 10Mbytes per Copy. (I wondered why it did 104 Copyies= ?)=0A= With both set to 128Mbytes I got:=0A= copyb 10.0sec 8copy = 524576=0A= >old-cp 21.9sec 16384read+16384write 1048864=0A= >new-cp 10.5sec 1024copy 524= 576=0A= >=0A= >NFSv4.1=0A= >copya 21.8sec 16384read+16384write 1048864=0A= >copyb 21.0sec 16384read+16384write 1048864=0A= >old-cp 21.8sec 16384read+16384write 1048864=0A= >new-cp 21.4sec 16384read+16384write 1048864=0A= >=0A= >Local on the UFS file system=0A= >copya 9.2sec n/a = 524576=0A= This turns out to be just variability in the test. I get 7.9sec->9.2sec=0A= for runs of all three of copya, copyb and new-cp for UFS.=0A= I think it is caching related, since I wasn't unmounting/remounting the=0A= UFS file system between test runs.=0A= >copyb 8.0sec n/a = 524576=0A= >old-cp 15.9sec n/a = 1048864=0A= >new-cp 7.9sec n/a = 524576=0A= >=0A= >So, for a NFSv4.2 mount, using SEEK_DATA/SEEK_HOLE is definitely=0A= >a performance hit, due to all the RPC rtts.=0A= >Your patched "cp" does fine, although a larger "len" reduces the=0A= >RPC count against the server.=0A= >All variants using copy_file_range() retain the holes.=0A= >=0A= >For NFSv4.1, it (not surprisingly) doesn't matter, since only NFSv4.2=0A= >supports SEEK_DATA/SEEK_HOLE and VOP_COPY_FILE_RANGE().=0A= >=0A= >For UFS, everything using copy_file_range() works pretty well and=0A= >retains the holes.=0A= =0A= >Although "copya" is guaranteed to retain the holes, it does run noticably= =0A= >slower than the others. Not sure why? Does the extra SEEK_DATA/SEEK_HOLE= =0A= >syscalls cost that much?=0A= Ignore this. It was just variability in the test runs.=0A= =0A= >The limitation of not using SEEK_DATA/SEEK_HOLE is that you will not=0A= >retain holes that straddle the byte range copied by two subsequent=0A= >copy_file_range(2) calls.=0A= This statement is misleading. These holes are partially retained, but there= =0A= will be a block allocated (at least for UFS) at the boundary, due the prope= rty of=0A= growing a file via VOP_SETATTR(size) as noted above.=0A= =0A= >--> This can be minimized by using a large "len", but that large "len"=0A= > results in slower response to signal handling.=0A= I'm going to play with "len" to-day and come up with some numbers=0A= w.r.t. signal handling response time vs the copy_file_range() "len" argumen= t.=0A= =0A= >I've attached the little programs, so you can play with them.=0A= >(Maybe try different sparse schemes/sizes? It might be fun to=0A= > make the holes/blocks some random multiple of hole size up=0A= > to a limit?)=0A= >=0A= >rick=0A= >ps: In case he isn't reading hackers these days, I've added kib@=0A= > as a cc. He might know why UFS is 15% slower when SEEK_HOLE=0A= > SEEK_DATA is used.=0A= Alan Somers wrote:=0A= > So it sounds like your main point is that for file systems with special s= upport, =0A= > copy_file_range(2) is more efficient for many sparse files than =0A= > SEEK_HOLE/SEEK_DATA.=0A= Well, for NFSv4.2 this is true. Who knows w.r.t. others in the future.=0A= =0A= > Sure, I buy that. And secondarily, you don't see any reason not to incr= ease the=0A= > len argument in commands like cp up to somewhere around 128 MB, delaying = =0A= > signal handling for about 1 second on a typical desktop (maybe set it low= er on =0A= > embedded arches).=0A= When I did some testing on my hardware (laptops with slow spinning disks),= =0A= I got up to about 2sec delay for 128Mbytes and up to about 1sec delay for= =0A= 64Mbyes. I got a post that suggested that 1sec should be the target and=0A= haven't heard differently from anyone else.=0A= =0A= Currently, there is a sysctl for NFS that clips the size of a copy_file_ran= ge(),=0A= so that RPC response is reasonable (1sec or less).=0A= Maybe that sysctl should be replaced with a generic one for copy_file_range= ()=0A= with a default of 64->128Mbytes. (I might make NFS use 1/2 of the sysctl=0A= value, since the RPC response time shouldn't exceed 1sec.)=0A= Does this sound reasonable?=0A= =0A= > And you think it's fine to allow copy_file_range on devfs, as long as th= e len =0A= > argument is clipped at some finite value. If we make all of those change= s, are=0A= > there any other reasons why the write/read fallback path would be needed= ?=0A= I'm on the fence w.r.t. this one. I understand why you would prefer a call = that=0A= worked for special files, but I also like the idea that it is "Linux compat= ible".=0A= =0A= I'd like to hear feedback from others on this.=0A= Maybe I'll try asking this question separately on freebsd-current@ and=0A= see if I can get others to respond.=0A= =0A= rick'=0A= =0A= -Alan=0A=