From nobody Fri Nov 7 02:01:36 2025 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4d2j3n5fWnz6G82P for ; Fri, 07 Nov 2025 02:01:57 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "WR4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4d2j3n0gvbz44Sj for ; Fri, 07 Nov 2025 02:01:57 +0000 (UTC) (envelope-from rick.macklem@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=MrIgV2dx; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of rick.macklem@gmail.com designates 2a00:1450:4864:20::52b as permitted sender) smtp.mailfrom=rick.macklem@gmail.com Received: by mail-ed1-x52b.google.com with SMTP id 4fb4d7f45d1cf-640860f97b5so429259a12.2 for ; Thu, 06 Nov 2025 18:01:56 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1762480909; x=1763085709; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UVUFtxbZ0Z7Id4MNRkZ6GoFMWtbrs5YPGo17e+ep1TA=; b=MrIgV2dxJEDfCZ2vessk1MfNHFuS1qZKayUfrsD/PmEov5eZA7fFa6/ujwjK7q1Dm5 7n4Q5gY/JEB9aG/2T5ZYqDZMdhe1mi7G/+5VM3XveseIp4HRMI71O4w+4nFgz1CEM5l3 Wy86b5zAkcBhCqLtjDWoU36W/yDfGzWY+Q+BX2fV4lSEC7q3Lgf8aw4XQSMKHzppsjPW WUbkQd537hqFgphHkMvW6tuLy4749rxjfdu1/Cfv6Rj3ot2ppMfE/LPug5vmRkSFtS9s hq2Yqei0sF3pDNfHy8nsucKDpJKpzHWHRzrfbQAw9D4yDH7keBoVUm7FnnbbY65wFNlr n8UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762480909; x=1763085709; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=UVUFtxbZ0Z7Id4MNRkZ6GoFMWtbrs5YPGo17e+ep1TA=; b=AfddbIi/rdOvm4G29mbz+c7zBgDdPIk20SWjMyDx2fzGt+2Kyv7NuyjA/LEB6Yjejo ddXAWU3PFH3SGLga61mnrmvwy/rvPnNMTntwiIl5b85DYdfvmk8/3kdH0X/7hXD6T5cF 9zjg7qPKlisbeqdLrMOUn1no6pmQRix8LbXqflWuaZMSc8icLmpSbYExDXo8z0PQHYgi 9UERViDvfit0NrjJ7z5fJtBb9yWMDR4kkSUPYEuJwqPsHsNasLX5EgZfIWzXSwRJHKRe 4cH1wmZg2ULJ3p+oxpyHAp5Xcj663EpEgQc7/MDyi9cPZl2SjMXwP0MEDfk6wXUzjjt2 MrPQ== X-Gm-Message-State: AOJu0YxRF4nG+YcDgMCq4TCtcVAdncWJ3F4qhrgq+xG24EXE9/S4BNT9 Vag3WgOztQUXe6lfRgs475o0eR4jRajj/w9Gnkqsy9wz3QA8Rds+A8DrphTwa6kEuZqoncXZy8L GTHB8sPNcAl31qoAbPlTUXVZE7Sy7P3P4eq0= X-Gm-Gg: ASbGncvjqek82YzRNQgpjoWe6Zkglp97Ew1sX60xoOPxTwR/1LEFxbpaj7WFCcnN5Ou 20JXdtrA9HuLBVr5mvzUKtDLEHieMNbolI8yZpPxqLHezOBpAdW0+BqwORSYiLseOKQao/LqoS1 Iuiq7XdRf3oQKhPO/VR2g/4Wqrc0QWql5KJ89Syt8blgZIQKP0Ar1BAV/YoXlfDSMSUkJNzJPdL ZqocPYNbsTXnETGRc3WXQJ1domJzVH3tw4MzZJj6CFwgWfRGMJajzQ6JOPSXfzxy1U1pv93uum/ NiRxG6+sBEV//cU= X-Google-Smtp-Source: AGHT+IHlGNii0Jl+Fw6WwqNL/JI7XaNut5cmmzPv2m/8SVXlxhbxcEdtE4GZVNURGJgtGeEsxRAFebysEQ+C0QkMSPY= X-Received: by 2002:a05:6402:3494:b0:641:1e78:60b5 with SMTP id 4fb4d7f45d1cf-6413eec6014mr1640319a12.1.1762480909113; Thu, 06 Nov 2025 18:01:49 -0800 (PST) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@FreeBSD.org MIME-Version: 1.0 References: In-Reply-To: From: Rick Macklem Date: Thu, 6 Nov 2025 18:01:36 -0800 X-Gm-Features: AWmQ_bm9b0flxyGOQcMKhLruWdmaYbfhsET8eTUK0AtB67ugv8o1s_ciNq_IvzU Message-ID: Subject: Re: Implementing VOP_READPLUS() in FreeBSD 15? To: =?UTF-8?Q?Aur=C3=A9lien_Couderc?= Cc: freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: -- X-Spamd-Result: default: False [-3.00 / 15.00]; SUBJECT_ENDS_QUESTION(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_LAST(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_FROM(0.00)[gmail.com]; FREEMAIL_TO(0.00)[gmail.com]; FREEMAIL_ENVFROM(0.00)[gmail.com]; TO_DN_SOME(0.00)[]; MIME_TRACE(0.00)[0:+]; TAGGED_FROM(0.00)[]; ARC_NA(0.00)[]; FROM_HAS_DN(0.00)[]; MISSING_XM_UA(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; MID_RHS_MATCH_FROMTLD(0.00)[]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; TAGGED_RCPT(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::52b:from]; RCVD_COUNT_ONE(0.00)[1]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4d2j3n0gvbz44Sj On Thu, Nov 6, 2025 at 11:40=E2=80=AFAM Aur=C3=A9lien Couderc wrote: > > This is a followup to a discussion with the nfs-ganesha developers. > > Could FreeBSD implement a VOP_READPLUS() in FreeBSD 15, please? > > Citing Lionel Cons/CERN: > > But the point is to optimise the read(). First, you have less traffic o= ver the wire (which is a > > thing if your reads are in the gigabyte range for large VMs), and it te= lls the VM host that it > > can just map all those MMU pages representing the hole to the "default = zero page", which > > in turn saves lots of space in the L3 and L2 caches ----> THIS DOES WON= DERS to VM > > performance. > > > > Example: > > The performance benefit here comes from the fast that instead of mappin= g a 1TB hole > > (1099511627776 bytes) to individual 524288 2M pages (x86 2M hugepage si= ze), and then > > potentially reading from them, you just have ONE 2M page in the cache, = and all reads come > > from that. > > > > READ_PLUS is THE game changer for that kind of application, especially = in our case (HPC > > simulations). Why doesn't the application use lseek(SEEK_DATA/SEEK_HOLE) and only read(2)= the data segments? This is implemented now in FreeBSD and in several other POSIX-like OSs and avoids problems like filling the buffer cache with blocks of all zeros or returning a lot of blocks with all zeros to the application via read(2). Right now, I not aware of any read_plus(2) syscall (please correct me if I am wrong on this), so applications that read(2) sparse files without bothering to do lseek(SEEK_DATA/SEEK_HOLE) will get a lot of 0s to process. To do VOP_READPLUS() is a lot of work. Once the VOP_READPLUS() is defined, there needs to be implementations in the various local fs (ZFS, UFS, ..). That requires work by people who know these areas. I am only minimally conversant with ei= ther ZFS or UFS and would not want to attempt to do a good VOP_READPLUS() implementation for either of them. (Without fs specific implementations, there isn't much point in doing it, imho.) If VOP_READPLUS() is done, but there is no readplus(2) syscall, then the applications still get globs of 0s in the read(2) reply (assuming the application doesn't bother to use lseek(SEEK_DATA/SEEK_HOLE) to skip over the holes in a sparse file). --> Even if FreeBSD were to "go out on a limb" and implement a readplus(2) syscall, who would use it. (Not anyone implementing a POSIX compliant application nor anyone implementing a Linux application.) --> Until Linux does some syscall like readplus(2) someday maybe I still question how useful VOP_READPLUS() is even if it has fs specific implementations. At least that's how I see it, rick > > I just played with that: > > 1. Intel XEON with 512GB > 2. loading 16 files with 64GB sparse files which are only holes > 3. create kernel core dump > Result: Almost all pages in the file cache are zero bytes. > > VOP_READPLUS() would optimize this case, and map all ranges belonging > to sparse file holes into the same read-only MMU page representing a > physical address range containing zero bytes. Because it's the same > physical memory it would consume very little L2/L3 cache space, and > save space in the filesystem cache too. > > Aur=C3=A9lien > -- > Aur=C3=A9lien Couderc > Big Data/Data mining expert, chess enthusiast >