From owner-freebsd-hackers@FreeBSD.ORG  Thu Dec 25 00:28:25 2014
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115])
 (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
 (No client certificate requested)
 by hub.freebsd.org (Postfix) with ESMTPS id 402666E7;
 Thu, 25 Dec 2014 00:28:25 +0000 (UTC)
Received: from mail-pd0-x233.google.com (mail-pd0-x233.google.com
 [IPv6:2607:f8b0:400e:c02::233])
 (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 0C40819BC;
 Thu, 25 Dec 2014 00:28:25 +0000 (UTC)
Received: by mail-pd0-f179.google.com with SMTP id fp1so10739904pdb.38;
 Wed, 24 Dec 2014 16:28:24 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=content-type:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=exMtCxRjUJI4+4ndAK79GKX3nGTQx8AN4SK5Lorj7NA=;
 b=G+/QkhO9c/ndnXuWWQRMEmHQ4SI2pg2XA7sIZzVQjrdygCxq7WvU5yKUHSALf6SdfS
 yGOuyUWU/iLynCjUbmHZvdaiQ5H8NK0w2bWuxs4ffUTpdQnZtgZ4WBgS9GfYA5iaSlJK
 JSUw4vYJVg8hGI8AmXYYzEFzcUZEgtIdhU456KluBXGQt+bqd+3L75tDgUF5Rb0rwvwa
 HmF4QQSl+k7a7c/oX0OkOKTPcEIvSVh0akN2GWKjFHZ3j/nlrv4cGtPaSo2lSaEhm/vN
 XhwT7xQyJA7qExFIYAFommfSnF06R6bYB/lZFI72Wzt+ZV2nuQn1U5LsGWisR2Ohyr6u
 SXzw==
X-Received: by 10.68.226.69 with SMTP id rq5mr57810450pbc.116.1419467304597;
 Wed, 24 Dec 2014 16:28:24 -0800 (PST)
Received: from [172.30.255.118] ([61.148.244.82])
 by mx.google.com with ESMTPSA id o10sm23815518pdr.96.2014.12.24.16.28.22
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 24 Dec 2014 16:28:23 -0800 (PST)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (1.0)
Subject: Re: status of projects/numa?
From: Gavin Mu <gavin.mu@gmail.com>
X-Mailer: iPad Mail (12B440)
In-Reply-To: <CAJ-Vmok4j1CKZBHbqk4F7aL3zRdesWP7ftST7DJcf0zvjC4n_w@mail.gmail.com>
Date: Thu, 25 Dec 2014 08:28:19 +0800
Content-Transfer-Encoding: quoted-printable
Message-Id: <A0A0C77A-08AD-4724-A994-E16D7F98C72E@gmail.com>
References: <D00EA1DD-7389-47ED-844D-A81EF8250C19@gmail.com>
 <CAJ-Vmok4j1CKZBHbqk4F7aL3zRdesWP7ftST7DJcf0zvjC4n_w@mail.gmail.com>
To: Adrian Chadd <adrian@freebsd.org>
Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org>
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.18-1
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
 <freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/options/freebsd-hackers>, 
 <mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers/>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
 <mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 25 Dec 2014 00:28:25 -0000

Hi, Adrian,

Thanks for such detail information. I think now I can and will:
1. start from projects/numa and have a summary first.
2. post the summary and thoughts out for discussion.
3. merge latest HEAD code.
4. implement the idea and post the patch for review.

Regards,
Gavin Mu

> On Dec 24, 2014, at 11:09, Adrian Chadd <adrian@freebsd.org> wrote:
>=20
> Ok, so to summarise the various bits/pieces I hear from around the
> grapevine and speaking with numa people like jhb:
>=20
> The projects/numa stuff has a lot of good stuff in it. Jeff/Isilon's
> idea seems to be to create an allocator policy framework that can be
> specified global, per-process and/or per-thread. That's pretty good
> stuff right there.
>=20
> What's missing however before it can be useful by drivers is the rest
> of the UMA stuff and I think there's some VM pagetable work that needs
> doing. I haven't dug into the latter because I was trusting that those
> working on projects/numa will get to it like they've said, but ..
> well, it's still not done.
>=20
> =46rom a UMA API perspective, it isn't trying very hard to return memory
> back to the domain that it was allocated from. I'm worried that on a
> real NUMA box with no policy configured or used by userland threads
> (ie, everything is being scheduled everywhere), we'll end up with
> threads allocating from a CPU-local pool, being migrated to another
> CPU, then freeing it there. So over time it'll end up with the cache
> spiked with non-local pages. I talked to jhb about it but I don't
> think we have a consensus about it at all. I think we should go the
> extra mile of ensuring that when we return pages to UMA it goes back
> into the right NUMA domain - I don't want to have to debug systems
> that start really fast and then end up getting slower over time as
> stuff ends up on the wrong CPU. But that's just me.
>=20
> =46rom the physmem perspective, I'll likely implement an SRAT-aware
> allocator option that allocates local-first, then tries to allocate
> from locall-er domains until it just round-robins. We have that cost
> matrix that says how expensive things are from a given NUMA domain, so
> that shouldn't be too difficult to pre-compute.
>=20
> =46rom a driver perspective, I've added the basic "which domain am I in"
> call.  John's been playing with a couple of drivers (igb/ixgbe I
> think) in his local repository for teaching them about what is a local
> interrupt and local CPU set to start threads on. Ideally all the
> drivers would use the same API for querying what their local cpuset is
> and assigning worker threads / interrupts appropriately. I should poke
> him again to find the status of that work and at least get that into
> -HEAD so it can be evaluated and used by people.
>=20
> What I'm hoping to do with that in the short term is to make it
> generic enough so that a generic, consistent set of hints can be
> configured for a given driver to setup its worker thread and cpuset
> map. Ie, instead of each driver having its own "how many queues"
> sysctl and probe logic, it'll have some API to say "give me my number
> of threads and local cpuset(s)" so if someone wants to override it,
> it's done by something like code in the bus layer rather than above
> individual driver hacks. We can also add options for things like "pin
> threads" or not.
>=20
> =46rom the driver /allocation/ perspective, there's a few things to think a=
bout:
>=20
> * how we allocate busdma memory for things like descriptors;
> * how we allocate DMA memory for things like mbufs, bufs, etc - what
> we're dma'ing into and out of.
>=20
> None of that is currently specified, although there's a couple of
> tidbits in the projects/numa branch that haven't been fleshed out.
>=20
> In my vague drawing-on-paper sense of this, I think we also should
> then extend busdma a little to be aware of a numaset for allocating
> memory for descriptors. Same with calling malloc and contigmalloc.
> Ideally we'd keep descriptor accesses in memory local to the device.
>=20
> Now for things like bufs/mbufs/etc - the DMA bits - that's where it
> gets a little tricky. Sometimes we're handed memory to send from/to.
> The NIC RX path allocates memory itself. storage controllers get
> handed memory to read storage data into. So, the higher layer question
> of "how do we tell something where to allocate memory from" gets a
> little complicated, because we (a) may have to allocate memory for a
> non-local device, knowing we're going to hand it to some device on
> another core, and (b) we then have to return it to the right NUMA
> aware pool (UMA, vm_page stuff, etc.)
>=20
> The other tricksy bit is when drivers want to allocate memory for
> local ephemeral work (eg M_TEMP) versus DMA/descriptor/busdma stuff.
> Here's about where I say "don't worry about this - do all the above
> bits first then worry about worrying about this":
>=20
> say we have a NIC driver on an 8-core CPU, and it's in a 4-socket box
> (so 4 sockets * 8 cores each socket.) The receive threads can be told
> to run in a specific cpuset local to the CPU the NIC is plugged into.
> But the transmit threads, taskqueue bits, etc may run on any CPU -
> remember, we don't queue traffic and wakeup a thread now; we call
> if_transmit() from whichever CPU is running the transmit code. So if
> it's local then it'll come from NUMA memory domain local memory and
> stuff is fine. But if it's transmitting from some /remote/ thread:
>=20
> * the mbuf that's been allocated to send data is likely allocated from
> the wrong NUMA domain;
> * the local memory used by the driver to do work should be coming from
> memory local to that CPU, but it may call malloc/etc to get temporary
> working memory - and that hopefully is also coming from a local memory
> domain;
> * if it has to allocate descriptor memory or something then it may
> need to allocate memory from the remote memory domain;
> * .. then touch the NIC hardware, which is remote.
>=20
> This is where I say "screw it, don't get bogged down in the details."
> It gets hairier when we think about what's local versus remote for
> vm_page entries, because hey, who knows what the right thing to do
> there is. But I think the right thing to do here is to not worry about
> it and get the above stuff done. For NUMA aware network applications
> they're likely going to be CPU set to threads local to the same domain
> as the NIC, and if that isn't good enough, we'll have to come up with
> some way to mark a socket as local to a NUMA domain so things like
> mbufs, TCP timers, etc get scheduled appropriately.
>=20
> (This is where I say "And this is where the whole RSS framework and
> NUMA need to know about each other", but I'm not really ready for that
> yet.)
>=20
> So, that's my braindump of what I remember from discussing with others.
>=20
> I think the stages to go here are:
>=20
> * get a firmer idea of what's missing from projects/numa for UMA,
> physmem and VM allocator / pagedaemon/etc stuff;
> * get jhb's numaset code into -HEAD;
> * get jhb's driver NUMA awareness bits out, add more generic hooks for
> expressing this stuff via hints, and dump that into -HEAD;
> * evaluate what we're going to do about mbufs, malloc/contigmalloc for
> busdma, bounce buffers, descriptors, etc and get that into -HEAD;
> * make sure the intel-pcm tools (and then hwpmc!) work for the uncore
> CPU interconnects work so we can measure how well we are/aren't doing
> things;
> * come back to the drawing board once the above is done and we've got
> some experience with it.
>=20
> I haven't even started talking about what's needed for /userland/
> memory pages. I think that's part of what needs to be fleshed out /
> discussed with the VM bits.
>=20
>=20
>=20
> -adrian