From owner-freebsd-arch@FreeBSD.ORG  Fri Oct 10 20:01:57 2014
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 58D6EE78;
 Fri, 10 Oct 2014 20:01:57 +0000 (UTC)
Received: from bigwig.baldwin.cx (bigwig.baldwin.cx [IPv6:2001:470:1f11:75::1])
 (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 30DC0BDF;
 Fri, 10 Oct 2014 20:01:57 +0000 (UTC)
Received: from ralph.baldwin.cx (pool-173-70-85-31.nwrknj.fios.verizon.net
 [173.70.85.31])
 by bigwig.baldwin.cx (Postfix) with ESMTPSA id B96E1B94A;
 Fri, 10 Oct 2014 16:01:55 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Subject: Re: [rfc] enumerating device / bus domain information
Date: Fri, 10 Oct 2014 16:01:31 -0400
Message-ID: <4090343.RYS6GcFkXt@ralph.baldwin.cx>
User-Agent: KMail/4.12.5 (FreeBSD/10.1-BETA2; KDE/4.12.5; amd64; ; )
In-Reply-To: <20141010180700.GS2153@kib.kiev.ua>
References: <CAJ-VmokF7Ey0fxaQ7EMBJpCbgFnyOteiL2497Z4AFovc+QRkTA@mail.gmail.com>
 <4435143.bthBSP8NlX@ralph.baldwin.cx> <20141010180700.GS2153@kib.kiev.ua>
MIME-Version: 1.0
Content-Transfer-Encoding: 7Bit
Content-Type: text/plain; charset="us-ascii"
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
 (bigwig.baldwin.cx); Fri, 10 Oct 2014 16:01:55 -0400 (EDT)
Cc: Adrian Chadd <adrian@freebsd.org>, freebsd-arch@freebsd.org
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 10 Oct 2014 20:01:57 -0000

On Friday, October 10, 2014 09:07:00 PM Konstantin Belousov wrote:
> On Fri, Oct 10, 2014 at 11:14:50AM -0400, John Baldwin wrote:
> > Even x86 already has a notion of multiple layers of cost.  You can get that
> > today if you buy a 4 socket Intel system.  It seems you might also get
> > that if you get a dual socket Haswell system with more than 8 cores per
> > package (due to the funky split-brain thing on higher core count
> > Haswells).  I believe AMD also ships CPUs that contain 2 NUMA domains
> > within a single physical package as well.
> > 
> > Note that the I/O thing is becoming far more urgent in the past few years
> > on x86.  With Nehalem/Westmere having I/O being remote or local didn't
> > seem to matter very much (you could only measure very small differences
> > in latency or throughput between the two scenarios in my experience).  On
> > Romley (Sandy Bridge) and later it can be a very substantial difference
> > in terms of both latency and throughput.
> 
> This nicely augments my note of the unsuitability of the interface to
> return VM domain for the given device.  I think that more correct is
> to return a bitset of the 'close enough' VM domains, where proximity
> is either explicitely asked by caller (like, belongs to, closer than
> two domains, etc) or just always return the best bitset.  It would
> solve both the split proximity domains issue, and multi-uplink south
> bridge issue.
>
> Might be, it makes sense to add additional object layer of the HW proximity
> domain, which contain some set of VM domains, and function would return
> such HW proximity domain.

I know Jeff has some sort of structure he wants to use for describing NUMA
policies.  Perhaps that is something that can be reused.  However, we
probably need to be further down the road to see what we actually need as
our final interface here.  In particular, I suspect we will have an orthogonal
set of APIs to deal with CPU locality (i.e. Give me a cpuset of all CPUs
in domain X or close to domain X, etc.).  In as much as there are requests
that are not bus-specific, I'd rather have drivers use those rather than
having everything go through new-bus.  (So that, for example, a multiqueue
NIC driver could bind its queues to CPUs belonging to the same NUMA domain it
is in rather than always using CPUs 0...N which is what all the Intel drivers
do currently.  Variations of this could also allow for more intelligent
requests like "give me all CPUs close to N that are suitable for interrupts"
which might include only one SMT thread per core.)

Also, this is orthogonal to overloading the word "VM domain" to mean something
that is a subset of a given NUMA domain.  I think regardless that it probably
makes sense to use a different term to describe more finely-grained partitions
of NUMA domains.

-- 
John Baldwin