Date: Wed, 20 Aug 2025 09:17:45 +0900 From: Wanpeng Qian <wanpengqian@gmail.com> To: =?UTF-8?Q?Corvin_K=C3=B6hne?= <corvink@freebsd.org> Cc: Peter Grehan <grehan@freebsd.org>, virtualization@freebsd.org, Oleksandr Kryvulia <shuriku@shurik.kiev.ua> Subject: Re: bhyve passthru problem Message-ID: <CANBJ%2BxSMEX%2B8eaPRiy3sGP0Ro2y_WcVdDTDay3dotp-vb_n4jQ@mail.gmail.com> In-Reply-To: <5473b6f9d3e542b45d9c7ef3e28c57b2f937ab79.camel@FreeBSD.org> References: <a63589a8-2cb2-4952-83b1-7a97e2f8cd44@shurik.kiev.ua> <38c9656c26fc3cee7ba733168c0fa2cdd01209d9.camel@FreeBSD.org> <c8c87fc3-2665-44c3-a8cf-6dcbd6525c38@freebsd.org> <5473b6f9d3e542b45d9c7ef3e28c57b2f937ab79.camel@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi all,
I am working with a PCIe x16 -> x4x4x4x4 quad-bifurcation carrier
hosting 4 SSDs.
One NVMe controller exposes a second BAR of 256 bytes (0x100).
nvme0@pci0:129:0:0: class=3D0x010802 rev=3D0x01 hdr=3D0x00 vendor=3D0x144d
device=3D0xa802 subvendor=3D0x144d subdevice=3D0xa801
    vendor     =3D 'Samsung Electronics Co Ltd'
    device     =3D 'NVMe SSD Controller SM951/PM951'
    class      =3D mass storage
    subclass   =3D NVM
    bar   [10] =3D type Memory, range 64, base 0xfbe00000, size 16384, enab=
led
    bar   [18] =3D type Memory, range 32, base 0xfbe04000, size 256, enable=
d
    cap 01[40] =3D powerspec 3  supports D0 D3  current D0
    cap 05[50] =3D MSI supports 8 messages, 64 bit
    cap 10[70] =3D PCI-Express 2 endpoint max data 128(128) FLR NS
                 max read 512
                 link x4(x4) speed 8.0(8.0) ClockPM disabled
    cap 11[b0] =3D MSI-X supports 9 messages, enabled
                 Table in map 0x10[0x3000], PBA in map 0x10[0x2000]
    ecap 0001[100] =3D AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[148] =3D Serial 1 0000000000000000
    ecap 0004[158] =3D Power Budgeting 1
    ecap 0019[168] =3D PCIe Sec 1 lane errors 0
    ecap 0018[188] =3D LTR 1
    ecap 001e[190] =3D L1 PM Substates 1
On FreeBSD, bhyve refuses to start with the error that the BAR base or
size is not page aligned.
After some digging, I believe it is safe to passthrough the entire 4
KiB page when=E2=80=94and only when=E2=80=94no other device=E2=80=99s BAR (=
different BDF)
falls within the same host physical page. This mirrors what QEMU/VFIO
does for sub-page BARs.
I have implemented the same =E2=80=9Cexclusive page=E2=80=9D check in bhyve=
 and posted a patch:
Review: https://reviews.freebsd.org/D52013
Background (QEMU/VFIO discussion):
https://lists.nongnu.org/archive/html/qemu-devel/2021-09/msg02908.html
Patch summary:
Build a table of occupied MMIO pages using PCIOCGETCONF / PCIOCGETBAR.
For memory BARs with size < PAGE_SIZE:
Require the BAR base to be page-aligned (unchanged).
If the 4 KiB page is exclusive to that BDF, allow passthrough and set
the host mapped_size =3D PAGE_SIZE, while keeping the guest-visible BAR
size unchanged.
Otherwise, keep rejecting.
No change for I/O BARs, or for memory BARs with size >=3D PAGE_SIZE
(still must be a multiple of PAGE_SIZE).
Testing:
Host: FreeBSD 14.3R with IOMMU enabled.
Devices: NVMe controller on a quad-bifurcation card; also tested
alongside an NVIDIA GTX 1080 Ti as another ppt device.
Guest: Windows 10 installs and boots from the passed-through NVMe device.
No regressions observed with devices that have >=3D 4 KiB BARs.
Feedback welcome on:
The placement and lifetime of the occupied-page cache.
Whether to gate this behind a runtime knob (e.g., -W sub4k=3Dauto|off|force=
).
Any additional scenarios you would like me to test.
Best regards,
Qian
On Mon, Jun 17, 2024 at 3:21=E2=80=AFPM Corvin K=C3=B6hne <corvink@freebsd.=
org> wrote:
>
> On Fri, 2024-06-14 at 17:50 +1000, Peter Grehan wrote:
> > > I don't know why bhyve validates the BAR size. The commit adding
> > > this
> > > check is old [1] and doesn't explain it. What bhyve could do is
> > > rounding up the BAR size to a full page size when allocating memory
> > > for
> > > the BAR.
> > >
> > > [1] https://github.com/freebsd/freebsd-
> > > src/commit/7a902ec0eccc752c9c38533ed123121eaaea1225
> >
> >   At the time, BIOSs would often place device BARs of less than a
> > page
> > size in the same physical page. Since EPT only gives page
> > granularity,
> > this would result in all those devices being available to the guest
> > even
> > if they hadn't been passed through.
> >
> > later,
> >
> > Peter.
> >
> >
>
> Thanks for the explanation!
>
> What can we do about it? Does FreeBSD remaps BARs if they aren't page
> aligned? If not, can we verify that the page is only used by a specific
> device?
>
>
> --
> Kind regards,
> Corvin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANBJ%2BxSMEX%2B8eaPRiy3sGP0Ro2y_WcVdDTDay3dotp-vb_n4jQ>
