Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 14 Apr 2021 21:57:13 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Warner Losh <imp@bsdimp.com>
Cc:        "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>
Subject:   Re: New device wiring option
Message-ID:  <YHc7CVko/n7C94IG@kib.kiev.ua>
In-Reply-To: <CANCZdfqVbqX1hGVPAwjm%2BaCfhA5t7Z=UajydNzV1gnUmdHVWOw@mail.gmail.com>
References:  <CANCZdfqVbqX1hGVPAwjm%2BaCfhA5t7Z=UajydNzV1gnUmdHVWOw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Apr 14, 2021 at 12:35:48PM -0600, Warner Losh wrote:
> Today, one can wire a PCI device like so:
> 
> hint.nvme.3.at="pci0:7:0:0"
> 
> to wire an instance to a unit number. This works well when you have a
> relatively static configuration.
> 
> However, if you have a number of carrier cards that have a bunch of
> storage, then you have a situation where you are wiring things like so:
> 
> hint.nvme.0.at="pci0:29:0:0"          # card 0 in carrier 1
> hint.nvme.4.at="pci0:30:0:0"          # card 1 in carrier 1
> hint.nvme.2.at="pci0:31:0:0"          # card 2 in carrier 1
> hint.nvme.3.at="pci0:32:0:0"          # card 3 in carrier 1
> hint.nvme.1.at="pci0:185:0:0"        # card 0 in carrier 2
> hint.nvme.5.at="pci0:186:0:0"        # card 1 in carrier 2
> hint.nvme.6.at="pci0:187:0:0"        # card 2 in carrier 2
> hint.nvme.7.at="pci0:188:0:0"        # card 3 in carrier 2
> 
> where the bus numbers are stable from boot to boot... unless one of the
> carrier cards isn't present, in which case the numbers change a bit, which
> moves the nvme unit numbers around. So if carrier 1 goes away, the PCI bus
> numbers on the second one may be 183, 184, 185, 186 so nvme1 becomes nvme8,
> nvme5 becomes nvme9, nvme6 becomes nvme1 and nvme7 becomes nvme5. In our
> application, this renumbering is undesirable. One might argue the
> application shouldn't care about the numbering, but we have one that does
> in a away that's tricky to remove that knowledge and dependency.
> 
> Fortunately, UEFI has solved this problem with their device paths. UEFI
> device paths are completely independent of PCI bus numbering, and other
> items that are the arbitrary choice of the OS and/or the firmware booting
> the system.
> 
> On a UEFI system, you might see paths more like the following for the above
> devices:
> 
> PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x0,0x0)
> PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x1,0x0)
> PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x2,0x0)
> PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x3,0x0)
> PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x0,0x0)/Pci(0x0,0x0)
> PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)
> PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)
> PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x3,0x0)/Pci(0x0,0x0)
> 
> and if the first carrier card goes away, the path to the second one is
> still the same. So one way out of this issue is to change the numbering to
> be something more like:
> 
> hint.nvme.0.at="uefi:PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x0,0x0)"
> hint.nvme.4.at="uefi:PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x1,0x0)"
> hint.nvme.2.at="uefi:PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x2,0x0)"
> hint.nvme.3.at="uefi:PciRoot(0x1)/Pci(0x1,0x1)/Pci(0x3,0x0)"
> hint.nvme.1.at
> ="uefi:PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x0,0x0)/Pci(0x0,0x0)"
> hint.nvme.5.at
> ="uefi:PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x1,0x0)/Pci(0x0,0x0)"
> hint.nvme.6.at
> ="uefi:PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x2,0x0)/Pci(0x0,0x0)"
> hint.nvme.7.at
> ="uefi:PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x3,0x0)/Pci(0x0,0x0)"
> 
> which would solve the problem nicely (of course with a special case for
> paths starting with "pci" for those use cases where that might still make
> sense).
> 
> I've started work on implementing this for PCI. And am looking for feedback
> before I get too far down that path. I plan on making these case
> insensitive because different UEFI tools produce paths rendered differently.
> 
> One could take this further, of course. The full UEFI path to the a
> partition on one of these devices is:
> PciRoot(0x2)/Pci(0x1,0x3)/Pci(0x0,0x0)/Pci(0x3,0x0)/Pci(0x0,0x0)/NVMe(0x1,A2-19-48-44-8B-44-1B-00)/HD(9,GPT,0F8518D9-2DE5-11E8-B5F1-3CFDFE9D5250,0x430,0x19000)
> so constructs like the following might make sense:
> 
> hint.nda.7.at="uefi:NVMe(0x1,A2-19-48-44-8B-44-1B-00)"
> hint.ada.44.at="uefi:Sata(0x0,0xFFFF,0x0)"
> 
> for wiring up CAM devices. However, while these extra uses would be nice,
> supporting them is beyond the scope of the initial work (though hopefully
> the initial work would make enabling these later easier). I plan on
> implementing a generic locator KPI for this, but will focus on only the
> uefi and newbus locators initially. Later acpi, ofw, fdt and other location
> mechanisms can be added. The uefi path stuff, btw, does not require the
> system boot using UEFI.
> 
> So I'm writing today to solicit feedback on this approach. John Baldwin has
> already offered some advice to structure this as a generic locator and to
> have some newbus integration, but to also think about the larger picture.
> I'm still working on the details about how to make the locators generic
> enough to widely useful to other locators, but also specific enough to deal
> with the variations between these different systems.

DMAR (Intel x86 IOMMU) has a similar issue: some configuration details
for DMA and MSI remapping require specifying PCIe bridges and PCIe
devices in a way that is invariant against bus renumbering and hot-plug.
They use paths from root ports through bridges down to the target. This
is encoded in the binary structures of the ACPI DMAR table, see the VT-d
document.

But more, some devices that need configuration WRT DMAR, are not PCIe, 
but still generate DMA and MSI interrupts.  For them, DMAR table uses
ACPI ANDD (ACPI Namespace Device Declaration).  Practically it is used
for devices behind LPC bridge.

So having such way to locate devices would be also useful for DMAR tweaking
and bhyve pass-through configs.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?YHc7CVko/n7C94IG>