Date: Thu, 6 Nov 2025 18:01:36 -0800 From: Rick Macklem <rick.macklem@gmail.com> To: =?UTF-8?Q?Aur=C3=A9lien_Couderc?= <aurelien.couderc2002@gmail.com> Cc: freebsd-hackers@freebsd.org Subject: Re: Implementing VOP_READPLUS() in FreeBSD 15? Message-ID: <CAM5tNy60PdVx0E_rB=x2c=wG33sM8F0FbTCXuAkGbaqk%2Bj%2BpiA@mail.gmail.com> In-Reply-To: <CA%2B1jF5rCb8Kx=9pPXtC=dwoCz88waBJeSkADeCwtZOONrKi2Ug@mail.gmail.com> References: <CA%2B1jF5rCb8Kx=9pPXtC=dwoCz88waBJeSkADeCwtZOONrKi2Ug@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 6, 2025 at 11:40=E2=80=AFAM Aur=C3=A9lien Couderc
<aurelien.couderc2002@gmail.com> wrote:
>
> This is a followup to a discussion with the nfs-ganesha developers.
>
> Could FreeBSD implement a VOP_READPLUS() in FreeBSD 15, please?
>
> Citing Lionel Cons/CERN:
> > But the point is to optimise the read(). First, you have less traffic o=
ver the wire (which is a
> > thing if your reads are in the gigabyte range for large VMs), and it te=
lls the VM host that it
> > can just map all those MMU pages representing the hole to the "default =
zero page", which
> > in turn saves lots of space in the L3 and L2 caches ----> THIS DOES WON=
DERS to VM
> > performance.
> >
> > Example:
> > The performance benefit here comes from the fast that instead of mappin=
g a 1TB hole
> > (1099511627776 bytes) to individual 524288 2M pages (x86 2M hugepage si=
ze), and then
> > potentially reading from them, you just have ONE 2M page in the cache, =
and all reads come
> > from that.
> >
> > READ_PLUS is THE game changer for that kind of application, especially =
in our case (HPC
> > simulations).
Why doesn't the application use lseek(SEEK_DATA/SEEK_HOLE) and only read(2)=
the
data segments?
This is implemented now in FreeBSD and in several other POSIX-like OSs
and avoids
problems like filling the buffer cache with blocks of all zeros or
returning a lot of blocks
with all zeros to the application via read(2).
Right now, I not aware of any read_plus(2) syscall (please correct me
if I am wrong on this),
so applications that read(2) sparse files without bothering to do
lseek(SEEK_DATA/SEEK_HOLE) will get a lot of 0s to process.
To do VOP_READPLUS() is a lot of work. Once the VOP_READPLUS() is defined,
there needs to be implementations in the various local fs (ZFS, UFS,
..). That requires
work by people who know these areas. I am only minimally conversant with ei=
ther
ZFS or UFS and would not want to attempt to do a good VOP_READPLUS()
implementation for either of them. (Without fs specific
implementations, there isn't
much point in doing it, imho.)
If VOP_READPLUS() is done, but there is no readplus(2) syscall, then the
applications still get globs of 0s in the read(2) reply (assuming the
application
doesn't bother to use lseek(SEEK_DATA/SEEK_HOLE) to skip over the
holes in a sparse file).
--> Even if FreeBSD were to "go out on a limb" and implement a
readplus(2) syscall, who would use it. (Not anyone implementing
a POSIX compliant application nor anyone implementing a Linux
application.)
--> Until Linux does some syscall like readplus(2) someday maybe
I still question how useful VOP_READPLUS() is even if it has
fs specific implementations.
At least that's how I see it, rick
>
> I just played with that:
>
> 1. Intel XEON with 512GB
> 2. loading 16 files with 64GB sparse files which are only holes
> 3. create kernel core dump
> Result: Almost all pages in the file cache are zero bytes.
>
> VOP_READPLUS() would optimize this case, and map all ranges belonging
> to sparse file holes into the same read-only MMU page representing a
> physical address range containing zero bytes. Because it's the same
> physical memory it would consume very little L2/L3 cache space, and
> save space in the filesystem cache too.
>
> Aur=C3=A9lien
> --
> Aur=C3=A9lien Couderc <aurelien.couderc2002@gmail.com>
> Big Data/Data mining expert, chess enthusiast
>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAM5tNy60PdVx0E_rB=x2c=wG33sM8F0FbTCXuAkGbaqk%2Bj%2BpiA>
