Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 1 Mar 2025 14:23:43 -0700
From:      Warner Losh <imp@bsdimp.com>
To:        Ravi Pokala <rpokala@freebsd.org>
Cc:        "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
Subject:   Re: PCI topology-based hints
Message-ID:  <CANCZdfp2NWGiitbGvgYZnHWZziOuyM8RKAmym=AABAiiraxPTw@mail.gmail.com>
In-Reply-To: <60B4BAAA-E333-4219-99BE-D6C1B198E0BD@freebsd.org>
References:  <60B4BAAA-E333-4219-99BE-D6C1B198E0BD@freebsd.org>

next in thread | previous in thread | raw e-mail | index | archive | help
--000000000000475458062f4e8a04
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Feb 28, 2025 at 12:43=E2=80=AFAM Ravi Pokala <rpokala@freebsd.org> =
wrote:

> Hi folks,
>
> Setting up device attachment hints based on PCI address is easy; it's
> right there in the manual (pci.4):
>
> | DEVICE WIRING
> |      You can wire the device unit at a given location with device.hints=
.
> |      Entries of the form hints.<name>.<unit>.at=3D"pci<B>:<S>:<F>" or
> |      hints.<name>.<unit>.at=3D"pci<D>:<B>:<S>:<F>" will force the drive=
r
> name to
> |      probe and attach at unit unit for any PCI device found to match th=
e
> |      specification, where:
> | ...
> |    Examples
> |      Given the following lines in /boot/device.hints:
> |      hint.nvme.3.at=3D"pci6:0:0" hint.igb.8.at=3D"pci14:0:0" If there i=
s a
> device
> |      that supports igb(4) at PCI bus 14 slot 0 function 0, then it will
> be
> |      assigned igb8 for probe and attach.  Likewise, if there is an
> nvme(4)
>
> That's all well and good in a world without pluggable and hot-swappable
> devices, but things get tricker when devices can appear and disappear.
>
> We have systems which have multiple U.2 bays, which take NVMe PCIe
> devices. Across multiple reboots, the <D, B, S, F> address assigned to th=
e
> device in each of those bays was consistent. Great! We set up wring hints
> for those devices, and confirmed that the wiring worked when devices were
> swapped ...
>
> .. until we added NIC into the hot-swap OCP slot and rebooted.
>
> While things continued to work before the reboot, upon reboot, many
> addresses changed. It looks like the slot into which the NIC was installe=
d,
> is on the same segment of the bus as the U.2 bays. When that segment was
> enumerated, the addresses got shuffled to include the NIC.
>
> So, we can't necessarily rely on the PCI <D, B, S, F> address. But the
> PCIe topology is consistent, even when devices are added and removed --
> it's the physical wiring between the root complex, bridges, devices, and
> expansion slots.
>
> The `lspci' utility -- ubiquitous on Linux, and available via the
> "sysutils/pciutils" port on FreeBSD -- can show the topology. For example=
,
> consider three NVMe devices, reported by `pciconf', and by `lspci's tree
> view (device details redacted):
>
> | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme
> | nvme2 pci0:65:0:0: ...
> | nvme0 pci0:133:0:0: ...
> | nvme1 pci0:137:0:0: ...
> | %
> | % lspci -vt | grep -C2 -E '^..-|NVMe'
> | -+-[0000:00]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> | --
> |  |           +-18.6  ...
> |  |           \-18.7  ...
> |  +-[0000:40]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> |  |           +-01.0  ...
> |  |           +-01.1-[41]----00.0  ${VENDOR} NVMe
> |  |           +-01.3-[42-43]--
> |  |           +-01.4-[44-45]--
> | --
> |  |           |            \-00.1  ...
> |  |           \-07.2  ...
> |  +-[0000:80]-+-00.0  Root Complex
> |  |           +-00.2  ...
> |  |           +-00.3  ...
> | --
> |  |           +-03.0  ...
> |  |           +-03.1-[83-84]--
> |  |           +-03.2-[85-86]----00.0  ${VENDOR} NVMe
> |  |           +-03.3-[87-88]--
> |  |           +-03.4-[89-8a]----00.0  ${VENDOR} NVMe
> |  |           +-04.0  ...
> |  |           +-05.0  ...
> | --
> |  |           |            \-00.1  ...
> |  |           \-07.2  ...
> |  \-[0000:c0]-+-00.0  Root Complex
> |              +-00.2  ...
> |              +-00.3  ...
>
> The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are
> only shown for the Root Complex devices. The second set of xdigits, "xy.z=
",
> are either an endpoint's "slot" and "function", or else a bridge device's
> (address?) and (slot?). If there is a bridge, there is a set of xdigits i=
n
> brackets next to each (slot?), which becomes the "bus" of the attached
> endpoint, and then "xy.z", which is the endpoint's "slot" and "function".
>
> Thus, we can see from the tree that the NVMe devices are "0000:41:00.0",
> "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is
> the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0",
> "pci0:137:0:0".) It is also apparent that the latter two devices are
> connected to the same bridge, which in turn is connected to a different
> root complex than the first device.
>
> The problem is, depending on what devices are connected to a given root
> complex, the "bus" component which is associated with a bridge slot can
> change. In the example above, with the current population of devices in t=
he
> "0000:80" portion of the tree, the "bus" components associated with bridg=
e
> "03" are "83", "85", "87", and "89". But add another device to "0000:80"
> and reboot, and the addresses associated with bridge "03" become "84",
> "86", "88", and "8a".
>
> The question is this: How do I indicate that I would like a certain devic=
e
> unit to be wired to a specific bridge device address and slot -- which
> cannot change -- rather than to a specific <D, B, S, F>, where the "B"
> component can change.
>
> Any thoughts?
>

Yes. You can use what's already there, but maybe not documented or is at
the very least underdocumented. You can wire devices to the UEFI path,
which is guaranteed to be unique and avoid all these problems.

hint.nvme.77.at=3D"UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)"

Which is on pcie root complex 2, then follow device 1 function 1 on that
bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0`
will do all the heavy lifting for you. TaDa! No bus numbers.

I added this several years ago to solve exactly this problem, or what
happens when you lose a riser card, etc.

Warner

--000000000000475458062f4e8a04
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Feb 28,=
 2025 at 12:43=E2=80=AFAM Ravi Pokala &lt;<a href=3D"mailto:rpokala@freebsd=
.org">rpokala@freebsd.org</a>&gt; wrote:<br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex">Hi folks,<br>
<br>
Setting up device attachment hints based on PCI address is easy; it&#39;s r=
ight there in the manual (pci.4):<br>
<br>
| DEVICE WIRING<br>
|=C2=A0 =C2=A0 =C2=A0 You can wire the device unit at a given location with=
 device.hints.<br>
|=C2=A0 =C2=A0 =C2=A0 Entries of the form hints.&lt;name&gt;.&lt;unit&gt;.a=
t=3D&quot;pci&lt;B&gt;:&lt;S&gt;:&lt;F&gt;&quot; or<br>
|=C2=A0 =C2=A0 =C2=A0 hints.&lt;name&gt;.&lt;unit&gt;.at=3D&quot;pci&lt;D&g=
t;:&lt;B&gt;:&lt;S&gt;:&lt;F&gt;&quot; will force the driver name to<br>
|=C2=A0 =C2=A0 =C2=A0 probe and attach at unit unit for any PCI device foun=
d to match the<br>
|=C2=A0 =C2=A0 =C2=A0 specification, where:<br>
| ...<br>
|=C2=A0 =C2=A0 Examples<br>
|=C2=A0 =C2=A0 =C2=A0 Given the following lines in /boot/device.hints:<br>
|=C2=A0 =C2=A0 =C2=A0 <a href=3D"http://hint.nvme.3.at" rel=3D"noreferrer" =
target=3D"_blank">hint.nvme.3.at</a>=3D&quot;pci6:0:0&quot; <a href=3D"http=
://hint.igb.8.at" rel=3D"noreferrer" target=3D"_blank">hint.igb.8.at</a>=3D=
&quot;pci14:0:0&quot; If there is a device<br>
|=C2=A0 =C2=A0 =C2=A0 that supports igb(4) at PCI bus 14 slot 0 function 0,=
 then it will be<br>
|=C2=A0 =C2=A0 =C2=A0 assigned igb8 for probe and attach.=C2=A0 Likewise, i=
f there is an nvme(4)<br>
<br>
That&#39;s all well and good in a world without pluggable and hot-swappable=
 devices, but things get tricker when devices can appear and disappear.<br>
<br>
We have systems which have multiple U.2 bays, which take NVMe PCIe devices.=
 Across multiple reboots, the &lt;D, B, S, F&gt; address assigned to the de=
vice in each of those bays was consistent. Great! We set up wring hints for=
 those devices, and confirmed that the wiring worked when devices were swap=
ped ...<br>
<br>
.. until we added NIC into the hot-swap OCP slot and rebooted.<br>
<br>
While things continued to work before the reboot, upon reboot, many address=
es changed. It looks like the slot into which the NIC was installed, is on =
the same segment of the bus as the U.2 bays. When that segment was enumerat=
ed, the addresses got shuffled to include the NIC.<br>
<br>
So, we can&#39;t necessarily rely on the PCI &lt;D, B, S, F&gt; address. Bu=
t the PCIe topology is consistent, even when devices are added and removed =
-- it&#39;s the physical wiring between the root complex, bridges, devices,=
 and expansion slots.<br>
<br>
The `lspci&#39; utility -- ubiquitous on Linux, and available via the &quot=
;sysutils/pciutils&quot; port on FreeBSD -- can show the topology. For exam=
ple, consider three NVMe devices, reported by `pciconf&#39;, and by `lspci&=
#39;s tree view (device details redacted):<br>
<br>
| % pciconf -l | tr &#39;@&#39; &#39; &#39; | sort -V -k2 | grep nvme<br>
| nvme2 pci0:65:0:0: ...<br>
| nvme0 pci0:133:0:0: ...<br>
| nvme1 pci0:137:0:0: ...<br>
| % <br>
| % lspci -vt | grep -C2 -E &#39;^..-|NVMe&#39;<br>
| -+-[0000:00]-+-00.0=C2=A0 Root Complex<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.2=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.3=C2=A0 ...<br>
| --<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-18.6=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\-18.7=C2=A0 ...<br>
|=C2=A0 +-[0000:40]-+-00.0=C2=A0 Root Complex<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.2=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.3=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-01.0=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-01.1-[41]----00.0=C2=A0=
 ${VENDOR} NVMe<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-01.3-[42-43]--<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-01.4-[44-45]--<br>
| --<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 \-00.1=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\-07.2=C2=A0 ...<br>
|=C2=A0 +-[0000:80]-+-00.0=C2=A0 Root Complex<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.2=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-00.3=C2=A0 ...<br>
| --<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-03.0=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-03.1-[83-84]--<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-03.2-[85-86]----00.0=C2=
=A0 ${VENDOR} NVMe<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-03.3-[87-88]--<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-03.4-[89-8a]----00.0=C2=
=A0 ${VENDOR} NVMe<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-04.0=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0+-05.0=C2=A0 ...<br>
| --<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0|=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 \-00.1=C2=A0 ...<br>
|=C2=A0 |=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0\-07.2=C2=A0 ...<br>
|=C2=A0 \-[0000:c0]-+-00.0=C2=A0 Root Complex<br>
|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +-00.2=C2=A0 ...<br>
|=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 +-00.3=C2=A0 ...<br>
<br>
The first set of xdigits, &quot;[0000:n0]&quot; are a &quot;domain&quot; an=
d &quot;bus&quot;, which are only shown for the Root Complex devices. The s=
econd set of xdigits, &quot;xy.z&quot;, are either an endpoint&#39;s &quot;=
slot&quot; and &quot;function&quot;, or else a bridge device&#39;s (address=
?) and (slot?). If there is a bridge, there is a set of xdigits in brackets=
 next to each (slot?), which becomes the &quot;bus&quot; of the attached en=
dpoint, and then &quot;xy.z&quot;, which is the endpoint&#39;s &quot;slot&q=
uot; and &quot;function&quot;.<br>
<br>
Thus, we can see from the tree that the NVMe devices are &quot;0000:41:00.0=
&quot;, &quot;0000:85:00.0&quot;, and &quot;0000:89:00.0&quot;. (Which, if =
you convert to decimal, is the same as reported by `pciconf&#39;: &quot;pci=
0:65:0:0&quot;, &quot;pci0:133:0:0&quot;, &quot;pci0:137:0:0&quot;.) It is =
also apparent that the latter two devices are connected to the same bridge,=
 which in turn is connected to a different root complex than the first devi=
ce.<br>
<br>
The problem is, depending on what devices are connected to a given root com=
plex, the &quot;bus&quot; component which is associated with a bridge slot =
can change. In the example above, with the current population of devices in=
 the &quot;0000:80&quot; portion of the tree, the &quot;bus&quot; component=
s associated with bridge &quot;03&quot; are &quot;83&quot;, &quot;85&quot;,=
 &quot;87&quot;, and &quot;89&quot;. But add another device to &quot;0000:8=
0&quot; and reboot, and the addresses associated with bridge &quot;03&quot;=
 become &quot;84&quot;, &quot;86&quot;, &quot;88&quot;, and &quot;8a&quot;.=
<br>
<br>
The question is this: How do I indicate that I would like a certain device =
unit to be wired to a specific bridge device address and slot -- which cann=
ot change -- rather than to a specific &lt;D, B, S, F&gt;, where the &quot;=
B&quot; component can change.<br>
<br>
Any thoughts?<br></blockquote><div><br></div><div>Yes. You can use what&#39=
;s already there, but maybe not documented or is at the very least underdoc=
umented. You can wire devices to the UEFI path, which is guaranteed to be u=
nique and avoid all these problems.</div><div><br></div><div><a href=3D"htt=
p://hint.nvme.77.at">hint.nvme.77.at</a>=3D&quot;UEFI:PcieRoot(2)/Pci(0x1,0=
x1)/Pci(0x0,0x0)&quot;</div><div><br></div><div>Which is on pcie root compl=
ex 2, then follow device 1 function 1 on that bus to device 0 function 0 on=
 the second zero. `devctl getpath UEFI nvme0` will do all the heavy lifting=
 for you. TaDa! No bus numbers.</div><div><br></div><div>I added this sever=
al years ago to solve exactly this problem, or what happens when you lose a=
 riser card, etc.</div><div><br></div><div>Warner</div><div>=C2=A0</div></d=
iv></div>

--000000000000475458062f4e8a04--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CANCZdfp2NWGiitbGvgYZnHWZziOuyM8RKAmym=AABAiiraxPTw>