From owner-freebsd-arch@FreeBSD.ORG Fri Oct 10 20:01:57 2014 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 58D6EE78; Fri, 10 Oct 2014 20:01:57 +0000 (UTC) Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 30DC0BDF; Fri, 10 Oct 2014 20:01:57 +0000 (UTC) Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net [173.70.85.31]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id B96E1B94A; Fri, 10 Oct 2014 16:01:55 -0400 (EDT) From: John Baldwin To: Konstantin Belousov Subject: Re: [rfc] enumerating device / bus domain information Date: Fri, 10 Oct 2014 16:01:31 -0400 Message-ID: <4090343.RYS6GcFkXt@ralph.baldwin.cx> User-Agent: KMail/4.12.5 (FreeBSD/10.1-BETA2; KDE/4.12.5; amd64; ; ) In-Reply-To: <20141010180700.GS2153@kib.kiev.ua> References: <4435143.bthBSP8NlX@ralph.baldwin.cx> <20141010180700.GS2153@kib.kiev.ua> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7 (bigwig.baldwin.cx); Fri, 10 Oct 2014 16:01:55 -0400 (EDT) Cc: Adrian Chadd , freebsd-arch@freebsd.org X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Oct 2014 20:01:57 -0000 On Friday, October 10, 2014 09:07:00 PM Konstantin Belousov wrote: > On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote: > > Even x86 already has a notion of multiple layers of cost. You can get that > > today if you buy a 4 socket Intel system. It seems you might also get > > that if you get a dual socket Haswell system with more than 8 cores per > > package (due to the funky split-brain thing on higher core count > > Haswells). I believe AMD also ships CPUs that contain 2 NUMA domains > > within a single physical package as well. > > > > Note that the I/O thing is becoming far more urgent in the past few years > > on x86. With Nehalem/Westmere having I/O being remote or local didn't > > seem to matter very much (you could only measure very small differences > > in latency or throughput between the two scenarios in my experience). On > > Romley (Sandy Bridge) and later it can be a very substantial difference > > in terms of both latency and throughput. > > This nicely augments my note of the unsuitability of the interface to > return VM domain for the given device. I think that more correct is > to return a bitset of the 'close enough' VM domains, where proximity > is either explicitely asked by caller (like, belongs to, closer than > two domains, etc) or just always return the best bitset. It would > solve both the split proximity domains issue, and multi-uplink south > bridge issue. > > Might be, it makes sense to add additional object layer of the HW proximity > domain, which contain some set of VM domains, and function would return > such HW proximity domain. I know Jeff has some sort of structure he wants to use for describing NUMA policies. Perhaps that is something that can be reused. However, we probably need to be further down the road to see what we actually need as our final interface here. In particular, I suspect we will have an orthogonal set of APIs to deal with CPU locality (i.e. Give me a cpuset of all CPUs in domain X or close to domain X, etc.). In as much as there are requests that are not bus-specific, I'd rather have drivers use those rather than having everything go through new-bus. (So that, for example, a multiqueue NIC driver could bind its queues to CPUs belonging to the same NUMA domain it is in rather than always using CPUs 0...N which is what all the Intel drivers do currently. Variations of this could also allow for more intelligent requests like "give me all CPUs close to N that are suitable for interrupts" which might include only one SMT thread per core.) Also, this is orthogonal to overloading the word "VM domain" to mean something that is a subset of a given NUMA domain. I think regardless that it probably makes sense to use a different term to describe more finely-grained partitions of NUMA domains. -- John Baldwin