Date: Sat, 26 Jul 2014 13:11:28 -0700 From: Adrian Chadd <adrian@freebsd.org> To: Jeff Roberson <jeff@freebsd.org> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>, Andrew Bates <andrewbates09@gmail.com> Subject: Re: Working on NUMA support Message-ID: <CAJ-Vmom-wWZLCuuAEKDO1vuaGaSQM-=4e3xoh3OeVibc6m9Z8A@mail.gmail.com> In-Reply-To: <00E55D89-BDD1-41AD-BBF6-6752B90E8324@ccsys.com> References: <CAPi5LmkRO4QLbR2JQV8FuT=jw2jjcCRbP8jT0kj1g8Ks%2B7jv8A@mail.gmail.com> <CAJ-VmonJPT-NUSi=Wnu7a0oNwe8V=LQMZ-fZGriC7H44edRVLg@mail.gmail.com> <CAPi5Lm=8Z3fh_vxKY26qC3oEv1Ap%2BRvFGRAOhRosF5UEnDTVpw@mail.gmail.com> <00E55D89-BDD1-41AD-BBF6-6752B90E8324@ccsys.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi all! Has there been any further progress on this? I've been working on making the receive side scaling support usable by mere mortals and I've reached a point where I'm going to need this awareness in the 10ge/40ge drivers for the hardware I have access to. I'm right now more interested in the kernel driver/allocator side of things, so: * when bringing up a NIC, figure out what are the "most local" CPUs to run on; * for each NIC queue, figure out what the "most local" bus resources are for NIC resources like descriptors and packet memory (eg mbufs); * for each NIC queue, figure out what the "most local" resources are for local driver structures that the NIC doesn't touch (eg per-queue state); * for each RSS bucket, figure out what the "most local" resources are for things like packet memory (mbufs), tcp/udp/inp control structures, etc. I had a chat with jhb yesterday and he reminded me that y'all at isilon have been looking into this. He described a few interesting cases from the kernel side to me. * On architectures with external IO controllers, the path cost from an IO device to multiple CPUs may be (almost) equivalent, so there's not a huge penalty to allocate things on the wrong CPU. I think it'll be nice to get CPU local affinity where possible so we can parallelise DRAM access fully, but we can play with this and see. * On architectures with CPU-integrated IO controllers, there's a large penalty for doing inter-CPU IO, * .. but there's not such a huge penalty for doing inter-CPU memory access. Given that, we may find that we should always put the IO resources local to the CPU it's attached to, even if we decide to run some / all of the IO for the device on another CPU. Ie, any RAM that the IO device is doing data or descriptor DMA into should be local to that device. John said that in his experience it seemed the penalty for a non-local CPU touching memory was much less than device DMA crossing QPI. So the tricky bit is figuring that out and expressing it all in a way that allows us to do memory allocation and CPU binding in a more aware way. The other half of this tricky thing is to allow it to be easily overridden by a curious developer or system administrator that wants to experiment with different policies. Now, I'm very specifically only addressing the low level kernel IO / memory allocation requirements here. There's other things to worry about up in userland; I think you're trying to address that in your KPI descriptions. Thoughts? -a
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-Vmom-wWZLCuuAEKDO1vuaGaSQM-=4e3xoh3OeVibc6m9Z8A>