Date: Sat, 01 Mar 2025 14:19:44 -0800 From: Ravi Pokala <rpokala@freebsd.org> To: Warner Losh <imp@bsdimp.com> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: PCI topology-based hints Message-ID: <94A47C59-46D7-40E3-B680-8364558BF623@panasas.com> In-Reply-To: <CANCZdfp2NWGiitbGvgYZnHWZziOuyM8RKAmym=AABAiiraxPTw@mail.gmail.com> References: <60B4BAAA-E333-4219-99BE-D6C1B198E0BD@freebsd.org> <CANCZdfp2NWGiitbGvgYZnHWZziOuyM8RKAmym=AABAiiraxPTw@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. --B_3823683588_540832194 Content-type: text/plain; charset="UTF-8" Content-transfer-encoding: quoted-printable > Yes. You can use what's already there, but maybe not documented or is at = the very least underdocumented. You can wire devices to the UEFI path, which= is guaranteed to be unique and avoid all these problems. >=20 > hint.nvme.77.at=3D"UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)" >=20 > Which is on pcie root complex 2, then follow device 1 function 1 on that = bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0` w= ill do all the heavy lifting for you. TaDa! No bus numbers. >=20 > I added this several years ago to solve exactly this problem, or what hap= pens when you lose a riser card, etc. >=20 > Warner =20 Sweet! Thanks Warner, that=E2=80=99s exactly what I=E2=80=99m looking for. :-) =20 You=E2=80=99re right that it=E2=80=99s under-documented. I think it should be relativel= y easy to find a list of buses which support wiring; I think this search sho= uld find them: =20 | grep -Erl 'DEVMETHOD.*hint' /usr/src/sys =20 And then make sure that the bus=E2=80=99 manpage describes the hinting mechanism,= and add cross-refs between the bus=E2=80=99 manpage and device.hints.5 =20 If that sounds right, I=E2=80=99ll see if I can find some time to do that in the = near future. =20 Thanks again! =20 -Ravi (rpokala@) =20 =20 From: Warner Losh <imp@bsdimp.com> Date: Saturday, March 1, 2025 at 13:23 To: Ravi Pokala <rpokala@freebsd.org> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: PCI topology-based hints =20 =20 =20 On Fri, Feb 28, 2025 at 12:43=E2=80=AFAM Ravi Pokala <rpokala@freebsd.org> wrote: Hi folks, Setting up device attachment hints based on PCI address is easy; it's right= there in the manual (pci.4): | DEVICE WIRING | You can wire the device unit at a given location with device.hints. | Entries of the form hints.<name>.<unit>.at=3D"pci<B>:<S>:<F>" or | hints.<name>.<unit>.at=3D"pci<D>:<B>:<S>:<F>" will force the driver na= me to | probe and attach at unit unit for any PCI device found to match the | specification, where: | ... | Examples | Given the following lines in /boot/device.hints: | hint.nvme.3.at=3D"pci6:0:0" hint.igb.8.at=3D"pci14:0:0" If there is a de= vice | that supports igb(4) at PCI bus 14 slot 0 function 0, then it will b= e | assigned igb8 for probe and attach. Likewise, if there is an nvme(4= ) That's all well and good in a world without pluggable and hot-swappable dev= ices, but things get tricker when devices can appear and disappear. We have systems which have multiple U.2 bays, which take NVMe PCIe devices.= Across multiple reboots, the <D, B, S, F> address assigned to the device in= each of those bays was consistent. Great! We set up wring hints for those d= evices, and confirmed that the wiring worked when devices were swapped ... .. until we added NIC into the hot-swap OCP slot and rebooted. While things continued to work before the reboot, upon reboot, many address= es changed. It looks like the slot into which the NIC was installed, is on t= he same segment of the bus as the U.2 bays. When that segment was enumerated= , the addresses got shuffled to include the NIC. So, we can't necessarily rely on the PCI <D, B, S, F> address. But the PCIe= topology is consistent, even when devices are added and removed -- it's the= physical wiring between the root complex, bridges, devices, and expansion s= lots. The `lspci' utility -- ubiquitous on Linux, and available via the "sysutils= /pciutils" port on FreeBSD -- can show the topology. For example, consider t= hree NVMe devices, reported by `pciconf', and by `lspci's tree view (device = details redacted): | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme | nvme2 pci0:65:0:0: ... | nvme0 pci0:133:0:0: ... | nvme1 pci0:137:0:0: ... | %=20 | % lspci -vt | grep -C2 -E '^..-|NVMe' | -+-[0000:00]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-18.6 ... | | \-18.7 ... | +-[0000:40]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | | +-01.0 ... | | +-01.1-[41]----00.0 ${VENDOR} NVMe | | +-01.3-[42-43]-- | | +-01.4-[44-45]-- | -- | | | \-00.1 ... | | \-07.2 ... | +-[0000:80]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-03.0 ... | | +-03.1-[83-84]-- | | +-03.2-[85-86]----00.0 ${VENDOR} NVMe | | +-03.3-[87-88]-- | | +-03.4-[89-8a]----00.0 ${VENDOR} NVMe | | +-04.0 ... | | +-05.0 ... | -- | | | \-00.1 ... | | \-07.2 ... | \-[0000:c0]-+-00.0 Root Complex | +-00.2 ... | +-00.3 ... The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are o= nly shown for the Root Complex devices. The second set of xdigits, "xy.z", a= re either an endpoint's "slot" and "function", or else a bridge device's (ad= dress?) and (slot?). If there is a bridge, there is a set of xdigits in brac= kets next to each (slot?), which becomes the "bus" of the attached endpoint,= and then "xy.z", which is the endpoint's "slot" and "function". Thus, we can see from the tree that the NVMe devices are "0000:41:00.0", "0= 000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is the = same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0", "pci0:137:0:0"= .) It is also apparent that the latter two devices are connected to the same= bridge, which in turn is connected to a different root complex than the fir= st device. The problem is, depending on what devices are connected to a given root com= plex, the "bus" component which is associated with a bridge slot can change.= In the example above, with the current population of devices in the "0000:8= 0" portion of the tree, the "bus" components associated with bridge "03" are= "83", "85", "87", and "89". But add another device to "0000:80" and reboot,= and the addresses associated with bridge "03" become "84", "86", "88", and = "8a". The question is this: How do I indicate that I would like a certain device = unit to be wired to a specific bridge device address and slot -- which canno= t change -- rather than to a specific <D, B, S, F>, where the "B" component = can change. Any thoughts? =20 Yes. You can use what's already there, but maybe not documented or is at th= e very least underdocumented. You can wire devices to the UEFI path, which i= s guaranteed to be unique and avoid all these problems. =20 hint.nvme.77.at=3D"UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)" =20 Which is on pcie root complex 2, then follow device 1 function 1 on that bu= s to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0` wil= l do all the heavy lifting for you. TaDa! No bus numbers. =20 I added this several years ago to solve exactly this problem, or what happe= ns when you lose a riser card, etc. =20 Warner =20 --B_3823683588_540832194 Content-type: text/html; charset="UTF-8" Content-transfer-encoding: quoted-printable <html xmlns:o=3D"urn:schemas-microsoft-com:office:office" xmlns:w=3D"urn:schema= s-microsoft-com:office:word" xmlns:m=3D"http://schemas.microsoft.com/office/20= 04/12/omml" xmlns=3D"http://www.w3.org/TR/REC-html40"><head><meta http-equiv=3DC= ontent-Type content=3D"text/html; charset=3Dutf-8"><meta name=3DGenerator content=3D= "Microsoft Word 15 (filtered medium)"><style><!-- /* Font Definitions */ @font-face {font-family:"Cambria Math"; panose-1:2 4 5 3 5 4 6 3 2 4;} @font-face {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4;} @font-face {font-family:Aptos; panose-1:2 11 0 4 2 2 2 2 2 4;} @font-face {font-family:Monaco; panose-1:2 0 5 0 0 0 0 0 0 0;} @font-face {font-family:"Times New Roman \(Body CS\)"; panose-1:2 2 6 3 5 4 5 2 3 4;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; font-size:12.0pt; font-family:"Aptos",sans-serif;} a:link, span.MsoHyperlink {mso-style-priority:99; color:blue; text-decoration:underline;} span.EmailStyle18 {mso-style-type:personal-reply; font-family:Monaco; font-variant:normal !important; color:windowtext; text-transform:none; position:relative; top:0pt; mso-text-raise:0pt; letter-spacing:0pt; mso-contextual-alternates:no; font-weight:normal; font-style:normal; text-decoration:none none; vertical-align:baseline;} .MsoChpDefault {mso-style-type:export-only; font-size:10.0pt; mso-ligatures:none;} @page WordSection1 {size:8.5in 11.0in; margin:1.0in 1.0in 1.0in 1.0in;} div.WordSection1 {page:WordSection1;} --></style></head><body lang=3DEN-US link=3Dblue vlink=3Dpurple style=3D'word-wrap:= break-word'><div class=3DWordSection1><p class=3DMsoNormal><span style=3D'font-siz= e:10.0pt;font-family:Monaco'>> Yes. You can use what's already there, but= maybe not documented or is at the very least underdocumented. You can wire = devices to the UEFI path, which is guaranteed to be unique and avoid all the= se problems.<o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:= 10.0pt;font-family:Monaco'>> <o:p></o:p></span></p><p class=3DMsoNormal><sp= an style=3D'font-size:10.0pt;font-family:Monaco'>> hint.nvme.77.at=3D"UE= FI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0,0x0)"<o:p></o:p></span></p><p class= =3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'>> <o:p></o:p= ></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Mon= aco'>> Which is on pcie root complex 2, then follow device 1 function 1 o= n that bus to device 0 function 0 on the second zero. `devctl getpath UEFI n= vme0` will do all the heavy lifting for you. TaDa! No bus numbers.<o:p></o:p= ></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Mon= aco'>> <o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10= .0pt;font-family:Monaco'>> I added this several years ago to solve exactl= y this problem, or what happens when you lose a riser card, etc.<o:p></o:p><= /span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monac= o'>> <o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0= pt;font-family:Monaco'>> Warner<o:p></o:p></span></p><p class=3DMsoNormal><= span style=3D'font-size:10.0pt;font-family:Monaco'><o:p> </o:p></span></p= ><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'>Sweet!= Thanks Warner, that=E2=80=99s exactly what I=E2=80=99m looking for. :-)<o:p></o:p></spa= n></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'><= o:p> </o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;= font-family:Monaco'>You=E2=80=99re right that it=E2=80=99s under-documented. I think it = should be relatively easy to find a list of buses which support wiring; I th= ink this search should find them:<o:p></o:p></span></p><p class=3DMsoNormal><s= pan style=3D'font-size:10.0pt;font-family:Monaco'><o:p> </o:p></span></p>= <p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'>| grep = -Erl 'DEVMETHOD.*hint' /usr/src/sys<o:p></o:p></span></p><p class=3DMsoNormal>= <span style=3D'font-size:10.0pt;font-family:Monaco'><o:p> </o:p></span></= p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'>And t= hen make sure that the bus=E2=80=99 manpage describes the hinting mechanism, and a= dd cross-refs between the bus=E2=80=99 manpage and device.hints.5<o:p></o:p></span= ></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'><o= :p> </o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;f= ont-family:Monaco'>If that sounds right, I=E2=80=99ll see if I can find some time = to do that in the near future.<o:p></o:p></span></p><p class=3DMsoNormal><span= style=3D'font-size:10.0pt;font-family:Monaco'><o:p> </o:p></span></p><p = class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'>Thanks aga= in!<o:p></o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;fo= nt-family:Monaco'><o:p> </o:p></span></p><p class=3DMsoNormal><span style= =3D'font-size:10.0pt;font-family:Monaco'>-Ravi (rpokala@)<o:p></o:p></span></p= ><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-family:Monaco'><o:p>&= nbsp;</o:p></span></p><p class=3DMsoNormal><span style=3D'font-size:10.0pt;font-= family:Monaco'><o:p> </o:p></span></p><div style=3D'border:none;border-to= p:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'><p class=3DMsoNormal style=3D'm= argin-left:.5in'><b><span style=3D'font-family:"Calibri",sans-serif;color:blac= k'>From: </span></b><span style=3D'font-family:"Calibri",sans-serif;color:blac= k'>Warner Losh <imp@bsdimp.com><br><b>Date: </b>Saturday, March 1, 202= 5 at 13:23<br><b>To: </b>Ravi Pokala <rpokala@freebsd.org><br><b>Cc: <= /b>"freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org&g= t;<br><b>Subject: </b>Re: PCI topology-based hints<o:p></o:p></span></p></di= v><div><p class=3DMsoNormal style=3D'margin-left:.5in'><o:p> </o:p></p></di= v><div><div><p class=3DMsoNormal style=3D'margin-left:.5in'><o:p> </o:p></p= ></div><p class=3DMsoNormal style=3D'margin-left:.5in'><o:p> </o:p></p><div= ><div><p class=3DMsoNormal style=3D'margin-left:.5in'>On Fri, Feb 28, 2025 at 12= :43<span style=3D'font-family:"Arial",sans-serif'>=E2=80=AF</span>AM Ravi Pokala <= ;<a href=3D"mailto:rpokala@freebsd.org">rpokala@freebsd.org</a>> wrote:<o:p= ></o:p></p></div><blockquote style=3D'border:none;border-left:solid #CCCCCC 1.= 0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-right:0in'><p class=3DM= soNormal style=3D'margin-left:.5in'>Hi folks,<br><br>Setting up device attachm= ent hints based on PCI address is easy; it's right there in the manual (pci.= 4):<br><br>| DEVICE WIRING<br>| You can wire the device = unit at a given location with device.hints.<br>| Entries= of the form hints.<name>.<unit>.at=3D"pci<B>:<S>= :<F>" or<br>| hints.<name>.<unit>= .at=3D"pci<D>:<B>:<S>:<F>" will force the dr= iver name to<br>| probe and attach at unit unit for any = PCI device found to match the<br>| specification, where:= <br>| ...<br>| Examples<br>| Given the foll= owing lines in /boot/device.hints:<br>| <a href=3D"http://= hint.nvme.3.at" target=3D"_blank">hint.nvme.3.at</a>=3D"pci6:0:0" <a h= ref=3D"http://hint.igb.8.at" target=3D"_blank">hint.igb.8.at</a>=3D"pci14:0:0= " If there is a device<br>| that supports igb(4) at= PCI bus 14 slot 0 function 0, then it will be<br>| assi= gned igb8 for probe and attach. Likewise, if there is an nvme(4)<br><b= r>That's all well and good in a world without pluggable and hot-swappable de= vices, but things get tricker when devices can appear and disappear.<br><br>= We have systems which have multiple U.2 bays, which take NVMe PCIe devices. = Across multiple reboots, the <D, B, S, F> address assigned to the devi= ce in each of those bays was consistent. Great! We set up wring hints for th= ose devices, and confirmed that the wiring worked when devices were swapped = ...<br><br>.. until we added NIC into the hot-swap OCP slot and rebooted.<br= ><br>While things continued to work before the reboot, upon reboot, many add= resses changed. It looks like the slot into which the NIC was installed, is = on the same segment of the bus as the U.2 bays. When that segment was enumer= ated, the addresses got shuffled to include the NIC.<br><br>So, we can't nec= essarily rely on the PCI <D, B, S, F> address. But the PCIe topology i= s consistent, even when devices are added and removed -- it's the physical w= iring between the root complex, bridges, devices, and expansion slots.<br><b= r>The `lspci' utility -- ubiquitous on Linux, and available via the "sy= sutils/pciutils" port on FreeBSD -- can show the topology. For example,= consider three NVMe devices, reported by `pciconf', and by `lspci's tree vi= ew (device details redacted):<br><br>| % pciconf -l | tr '@' ' ' | sort -V -= k2 | grep nvme<br>| nvme2 pci0:65:0:0: ...<br>| nvme0 pci0:133:0:0: ...<br>|= nvme1 pci0:137:0:0: ...<br>| % <br>| % lspci -vt | grep -C2 -E '^..-|NVMe'<= br>| -+-[0000:00]-+-00.0 Root Complex<br>| | = +-00.2 ...<br>| |  = ; +-00.3 ...<br>| --<br>| | &nb= sp; +-18.6 ...<br>| | &n= bsp; \-18.7 ...<br>| +-[0000:40]-+-00.0 Root Complex= <br>| | +-00.2 ...<br>|&= nbsp; | +-00.3 ...<br>| = | +-01.0 ...<br>| | = ; +-01.1-[41]----00.0 ${VENDOR} NVMe= <br>| | +-01.3-[42-43]--<br>|&= nbsp; | +-01.4-[44-45]--<br>| --<br>= | | | &nbs= p; \-00.1 ...<br>| | &n= bsp; \-07.2 ...<br>| +-[0000:80]-+-00.0 Root Complex= <br>| | +-00.2 ...<br>|&= nbsp; | +-00.3 ...<br>| --<br>= | | +-03.0 ...<br>| = ; | +-03.1-[83-84]--<br>| |&nb= sp; +-03.2-[85-86]----00.0 ${VENDOR}= NVMe<br>| | +-03.3-[87-88]--<= br>| | +-03.4-[89-8a]----00.0&= nbsp; ${VENDOR} NVMe<br>| | +-= 04.0 ...<br>| | +-05.0&n= bsp; ...<br>| --<br>| | | = ; \-00.1 ...<br>| | &nb= sp; \-07.2 ...<br>| \-[0000:c0]-+-00.= 0 Root Complex<br>| +-= 00.2 ...<br>| +-00.3&n= bsp; ...<br><br>The first set of xdigits, "[0000:n0]" are a "= domain" and "bus", which are only shown for the Root Complex = devices. The second set of xdigits, "xy.z", are either an endpoint= 's "slot" and "function", or else a bridge device's (add= ress?) and (slot?). If there is a bridge, there is a set of xdigits in brack= ets next to each (slot?), which becomes the "bus" of the attached = endpoint, and then "xy.z", which is the endpoint's "slot"= ; and "function".<br><br>Thus, we can see from the tree that the N= VMe devices are "0000:41:00.0", "0000:85:00.0", and &quo= t;0000:89:00.0". (Which, if you convert to decimal, is the same as repo= rted by `pciconf': "pci0:65:0:0", "pci0:133:0:0", "= pci0:137:0:0".) It is also apparent that the latter two devices are con= nected to the same bridge, which in turn is connected to a different root co= mplex than the first device.<br><br>The problem is, depending on what device= s are connected to a given root complex, the "bus" component which= is associated with a bridge slot can change. In the example above, with the= current population of devices in the "0000:80" portion of the tre= e, the "bus" components associated with bridge "03" are = "83", "85", "87", and "89". But add = another device to "0000:80" and reboot, and the addresses associat= ed with bridge "03" become "84", "86", "8= 8", and "8a".<br><br>The question is this: How do I indicate = that I would like a certain device unit to be wired to a specific bridge dev= ice address and slot -- which cannot change -- rather than to a specific <= ;D, B, S, F>, where the "B" component can change.<br><br>Any th= oughts?<o:p></o:p></p></blockquote><div><p class=3DMsoNormal style=3D'margin-lef= t:.5in'><o:p> </o:p></p></div><div><p class=3DMsoNormal style=3D'margin-lef= t:.5in'>Yes. You can use what's already there, but maybe not documented or i= s at the very least underdocumented. You can wire devices to the UEFI path, = which is guaranteed to be unique and avoid all these problems.<o:p></o:p></p= ></div><div><p class=3DMsoNormal style=3D'margin-left:.5in'><o:p> </o:p></p= ></div><div><p class=3DMsoNormal style=3D'margin-left:.5in'><a href=3D"http://hint= .nvme.77.at">hint.nvme.77.at</a>=3D"UEFI:PcieRoot(2)/Pci(0x1,0x1)/Pci(0x0= ,0x0)"<o:p></o:p></p></div><div><p class=3DMsoNormal style=3D'margin-left:.= 5in'><o:p> </o:p></p></div><div><p class=3DMsoNormal style=3D'margin-left:.= 5in'>Which is on pcie root complex 2, then follow device 1 function 1 on tha= t bus to device 0 function 0 on the second zero. `devctl getpath UEFI nvme0`= will do all the heavy lifting for you. TaDa! No bus numbers.<o:p></o:p></p>= </div><div><p class=3DMsoNormal style=3D'margin-left:.5in'><o:p> </o:p></p>= </div><div><p class=3DMsoNormal style=3D'margin-left:.5in'>I added this several = years ago to solve exactly this problem, or what happens when you lose a ris= er card, etc.<o:p></o:p></p></div><div><p class=3DMsoNormal style=3D'margin-left= :.5in'><o:p> </o:p></p></div><div><p class=3DMsoNormal style=3D'margin-left= :.5in'>Warner<o:p></o:p></p></div><div><p class=3DMsoNormal style=3D'margin-left= :.5in'> <o:p></o:p></p></div></div></div></div></body></html> --B_3823683588_540832194--
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?94A47C59-46D7-40E3-B680-8364558BF623>