Date: Thu, 27 Feb 2025 23:42:57 -0800 From: Ravi Pokala <rpokala@freebsd.org> To: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: PCI topology-based hints Message-ID: <60B4BAAA-E333-4219-99BE-D6C1B198E0BD@freebsd.org>
next in thread | raw e-mail | index | archive | help
Hi folks, Setting up device attachment hints based on PCI address is easy; it's right there in the manual (pci.4): | DEVICE WIRING | You can wire the device unit at a given location with device.hints. | Entries of the form hints.<name>.<unit>.at="pci<B>:<S>:<F>" or | hints.<name>.<unit>.at="pci<D>:<B>:<S>:<F>" will force the driver name to | probe and attach at unit unit for any PCI device found to match the | specification, where: | ... | Examples | Given the following lines in /boot/device.hints: | hint.nvme.3.at="pci6:0:0" hint.igb.8.at="pci14:0:0" If there is a device | that supports igb(4) at PCI bus 14 slot 0 function 0, then it will be | assigned igb8 for probe and attach. Likewise, if there is an nvme(4) That's all well and good in a world without pluggable and hot-swappable devices, but things get tricker when devices can appear and disappear. We have systems which have multiple U.2 bays, which take NVMe PCIe devices. Across multiple reboots, the <D, B, S, F> address assigned to the device in each of those bays was consistent. Great! We set up wring hints for those devices, and confirmed that the wiring worked when devices were swapped ... .. until we added NIC into the hot-swap OCP slot and rebooted. While things continued to work before the reboot, upon reboot, many addresses changed. It looks like the slot into which the NIC was installed, is on the same segment of the bus as the U.2 bays. When that segment was enumerated, the addresses got shuffled to include the NIC. So, we can't necessarily rely on the PCI <D, B, S, F> address. But the PCIe topology is consistent, even when devices are added and removed -- it's the physical wiring between the root complex, bridges, devices, and expansion slots. The `lspci' utility -- ubiquitous on Linux, and available via the "sysutils/pciutils" port on FreeBSD -- can show the topology. For example, consider three NVMe devices, reported by `pciconf', and by `lspci's tree view (device details redacted): | % pciconf -l | tr '@' ' ' | sort -V -k2 | grep nvme | nvme2 pci0:65:0:0: ... | nvme0 pci0:133:0:0: ... | nvme1 pci0:137:0:0: ... | % | % lspci -vt | grep -C2 -E '^..-|NVMe' | -+-[0000:00]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-18.6 ... | | \-18.7 ... | +-[0000:40]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | | +-01.0 ... | | +-01.1-[41]----00.0 ${VENDOR} NVMe | | +-01.3-[42-43]-- | | +-01.4-[44-45]-- | -- | | | \-00.1 ... | | \-07.2 ... | +-[0000:80]-+-00.0 Root Complex | | +-00.2 ... | | +-00.3 ... | -- | | +-03.0 ... | | +-03.1-[83-84]-- | | +-03.2-[85-86]----00.0 ${VENDOR} NVMe | | +-03.3-[87-88]-- | | +-03.4-[89-8a]----00.0 ${VENDOR} NVMe | | +-04.0 ... | | +-05.0 ... | -- | | | \-00.1 ... | | \-07.2 ... | \-[0000:c0]-+-00.0 Root Complex | +-00.2 ... | +-00.3 ... The first set of xdigits, "[0000:n0]" are a "domain" and "bus", which are only shown for the Root Complex devices. The second set of xdigits, "xy.z", are either an endpoint's "slot" and "function", or else a bridge device's (address?) and (slot?). If there is a bridge, there is a set of xdigits in brackets next to each (slot?), which becomes the "bus" of the attached endpoint, and then "xy.z", which is the endpoint's "slot" and "function". Thus, we can see from the tree that the NVMe devices are "0000:41:00.0", "0000:85:00.0", and "0000:89:00.0". (Which, if you convert to decimal, is the same as reported by `pciconf': "pci0:65:0:0", "pci0:133:0:0", "pci0:137:0:0".) It is also apparent that the latter two devices are connected to the same bridge, which in turn is connected to a different root complex than the first device. The problem is, depending on what devices are connected to a given root complex, the "bus" component which is associated with a bridge slot can change. In the example above, with the current population of devices in the "0000:80" portion of the tree, the "bus" components associated with bridge "03" are "83", "85", "87", and "89". But add another device to "0000:80" and reboot, and the addresses associated with bridge "03" become "84", "86", "88", and "8a". The question is this: How do I indicate that I would like a certain device unit to be wired to a specific bridge device address and slot -- which cannot change -- rather than to a specific <D, B, S, F>, where the "B" component can change. Any thoughts? Thanks, Ravi (rpokala@)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?60B4BAAA-E333-4219-99BE-D6C1B198E0BD>