Date: Fri, 10 Oct 2014 11:14:50 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-arch@freebsd.org Cc: Adrian Chadd <adrian@freebsd.org> Subject: Re: [rfc] enumerating device / bus domain information Message-ID: <4435143.bthBSP8NlX@ralph.baldwin.cx> In-Reply-To: <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com> References: <CAJ-VmokF7Ey0fxaQ7EMBJpCbgFnyOteiL2497Z4AFovc%2BQRkTA@mail.gmail.com> <CAJ-VmonbGW1JbEiKXJ0sQCFr0%2BCRphVrSuBhFnh1gq6-X1CFdQ@mail.gmail.com> <838B58B2-22D6-4AA4-97D5-62E87101F234@bsdimp.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday, October 09, 2014 09:53:52 PM Warner Losh wrote: > On Oct 8, 2014, at 5:12 PM, Adrian Chadd <adrian@FreeBSD.org> wrote: > > On 8 October 2014 12:07, Warner Losh <imp@bsdimp.com> wrote: > >> On Oct 7, 2014, at 7:37 PM, Adrian Chadd <adrian@FreeBSD.org> wrot= e: > >>> Hi, > >>>=20 > >>> Right now we're not enumerating any NUMA domain information about= > >>> devices. > >>>=20 > >>> The more recent intel NUMA stuff has some extra affinity informat= ion > >>> for devices that (eventually) will allow us to bind kernel/user > >>> threads and/or memory allocation to devices to keep access local.= > >>> There's a penalty for DMAing in/out of remote memory, so we'll wa= nt to > >>> figure out what counts as "Local" for memory allocation and perha= ps > >>> constrain the CPU set that worker threads for a device run on. > >>>=20 > >>> This patch adds a few things: > >>>=20 > >>> * it adds a bus_if.m method for fetching the VM domain ID of a gi= ven > >>> device; or ENOENT if it's not in a VM domain; > >>=20 > >> Maybe a default VM domain. All devices are in VM domains :) By def= ault > >> today, we have only one VM domain, and that=E2=80=99s the model th= at most of the > >> code expects=E2=80=A6 > >=20 > > Right, and that doesn't change until you compile in with num domain= s > 1. >=20 > The first part of the statement doesn=E2=80=99t change when the numbe= r of domains > is more than one. All devices are in a VM domain. >=20 > > Then, CPUs and memory have VM domains, but devices may or may not h= ave > > a VM domain. There's no "default" VM domain defined if num domains = > > > 1. >=20 > Please explain how a device cannot have a VM domain? For the > terminology I'm familiar with, to even get cycles to the device, you = have to > have a memory address (or an I/O port). That memory address has to > necessarily map to some domain, even if that domain is equally sucky = to get > to from all CPUs (as is the case with I/O ports). while there may not= be a > =E2=80=9Cdefault=E2=80=9D domain, by virtue of its physical location = it has to have one. >=20 > > The devices themselves don't know about VM domains right now, so > > there's nothing constraining things like IRQ routing, CPU set, memo= ry > > allocation, etc. The isilon team is working on extending the cpuset= > > and allocators to "know" about numa and I'm sure this stuff will fa= ll > > out of whatever they're working on. >=20 > Why would the device need to know the domain? Why aren=E2=80=99t the = IRQs, > for example, steered to the appropriate CPU? Why doesn=E2=80=99t the = bus handle > allocating memory for it in the appropriate place? How does this =E2=80= =9Cdomain=E2=80=9D > tie into memory allocation and thread creation? Because that's not what you always want (though it often is). However,= another reason is that system administrators want to know what devices are close to. You can sort of figure it out from devinfo on a modern x86 machine if you squint right, but isn't super obvious. I have a fol= lowup patch that adds a new per-device '%domain' sysctl node so that it is easier to see which domain a device is close to. In real-world experie= nce this can be useful as it lets a sysadmin/developer know which CPUs to schedule processes on. (Note that it doesn't always mean you put them close to the device. Sometimes you have processes that are more import= ant=20 than others, so you tie those close to the NIC and shove the other ones= over=20 to the "wrong" domain because you don't care if they have higher latenc= y.) > > So when I go to add sysctl and other tree knowledge for device -> v= m > > domain mapping I'm going to make them return -1 for "no domain.=E2=80= =9D >=20 > Seems like there=E2=80=99s too many things lumped together here. Firs= t off, how > can there be no domain. That just hurts my brain. It has to be in som= e > domain, or it can=E2=80=99t be seen. Maybe this domain is one that su= cks for > everybody to access, maybe it is one that=E2=80=99s fast for some CPU= or package of > CPUs to access, but it has to have a domain. They are not always tied to a single NUMA domain. On some dual-socket=20= Nehalem/Westmere class machines with per-CPU memory controllers (so 2 N= UMA=20 domains) you will have a single I/O hub that is directly connected to b= oth=20 CPUs. Thus, all memory in the system is equi-distant for I/O (but not = for CPU=20 access). The other problem is that you simply may not know. Not all BIOSes corr= ectly=20 communicate this information for devices. For example, certain 1U Roml= ey=20 servers I have worked with properly enumerate CPU <--> memory relations= hips in=20 the SRAT table, but they fail to include the necessary _PXM method in t= he top- level PCI bus devices (that correspond to the I/O hub). In that case,=20= returning a domain of 0 may very well be wrong. (In fact, for these=20= particular machines it mostly _is_ wrong as the expansion slots are all= tied=20 to NUMA domain 1, not 0.) > > (Things will get pretty hilarious later on if we have devices that = are > > "local" to two or more VM domains ..) >=20 > Well, devices aren=E2=80=99t local to domains, per se. Devices can co= mmunicate with > other components in a system at a given cost. One NUMA model is =E2=80= =9Cnear=E2=80=9D vs > =E2=80=9Cfar=E2=80=9D where a single near domain exists and all the =E2= =80=9Cfar=E2=80=9D resources are > quite costly. Other NUMA models may have a wider range of costs so th= at > some resources are cheap, others are a little less cheap, while other= s are > down right expensive depending on how far across the fabric of > interconnects the messages need to travel. While one can model this a= s a > full 1-1 partitioning, that doesn=E2=80=99t match all of the extant > implementations, even today. It is easy, but an imperfect match to th= e > underlying realities in many cases (though a very good match to x86, = which > is mostly what we care about). Even x86 already has a notion of multiple layers of cost. You can get = that=20 today if you buy a 4 socket Intel system. It seems you might also get = that if=20 you get a dual socket Haswell system with more than 8 cores per package= (due=20 to the funky split-brain thing on higher core count Haswells). I belie= ve AMD=20 also ships CPUs that contain 2 NUMA domains within a single physical pa= ckage=20 as well. Note that the I/O thing is becoming far more urgent in the past few yea= rs on=20 x86. With Nehalem/Westmere having I/O being remote or local didn't see= m to=20 matter very much (you could only measure very small differences in late= ncy or=20 throughput between the two scenarios in my experience). On Romley (San= dy=20 Bridge) and later it can be a very substantial difference in terms of b= oth=20 latency and throughput. --=20 John Baldwin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4435143.bthBSP8NlX>