From owner-freebsd-hackers@FreeBSD.ORG Thu Dec 25 00:28:25 2014 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [8.8.178.115]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 402666E7; Thu, 25 Dec 2014 00:28:25 +0000 (UTC) Received: from mail-pd0-x233.google.com (mail-pd0-x233.google.com [IPv6:2607:f8b0:400e:c02::233]) (using TLSv1 with cipher ECDHE-RSA-RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0C40819BC; Thu, 25 Dec 2014 00:28:25 +0000 (UTC) Received: by mail-pd0-f179.google.com with SMTP id fp1so10739904pdb.38; Wed, 24 Dec 2014 16:28:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=exMtCxRjUJI4+4ndAK79GKX3nGTQx8AN4SK5Lorj7NA=; b=G+/QkhO9c/ndnXuWWQRMEmHQ4SI2pg2XA7sIZzVQjrdygCxq7WvU5yKUHSALf6SdfS yGOuyUWU/iLynCjUbmHZvdaiQ5H8NK0w2bWuxs4ffUTpdQnZtgZ4WBgS9GfYA5iaSlJK JSUw4vYJVg8hGI8AmXYYzEFzcUZEgtIdhU456KluBXGQt+bqd+3L75tDgUF5Rb0rwvwa HmF4QQSl+k7a7c/oX0OkOKTPcEIvSVh0akN2GWKjFHZ3j/nlrv4cGtPaSo2lSaEhm/vN XhwT7xQyJA7qExFIYAFommfSnF06R6bYB/lZFI72Wzt+ZV2nuQn1U5LsGWisR2Ohyr6u SXzw== X-Received: by 10.68.226.69 with SMTP id rq5mr57810450pbc.116.1419467304597; Wed, 24 Dec 2014 16:28:24 -0800 (PST) Received: from [172.30.255.118] ([61.148.244.82]) by mx.google.com with ESMTPSA id o10sm23815518pdr.96.2014.12.24.16.28.22 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 24 Dec 2014 16:28:23 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (1.0) Subject: Re: status of projects/numa? From: Gavin Mu X-Mailer: iPad Mail (12B440) In-Reply-To: Date: Thu, 25 Dec 2014 08:28:19 +0800 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Adrian Chadd Cc: "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 25 Dec 2014 00:28:25 -0000 Hi, Adrian, Thanks for such detail information. I think now I can and will: 1. start from projects/numa and have a summary first. 2. post the summary and thoughts out for discussion. 3. merge latest HEAD code. 4. implement the idea and post the patch for review. Regards, Gavin Mu > On Dec 24, 2014, at 11:09, Adrian Chadd wrote: >=20 > Ok, so to summarise the various bits/pieces I hear from around the > grapevine and speaking with numa people like jhb: >=20 > The projects/numa stuff has a lot of good stuff in it. Jeff/Isilon's > idea seems to be to create an allocator policy framework that can be > specified global, per-process and/or per-thread. That's pretty good > stuff right there. >=20 > What's missing however before it can be useful by drivers is the rest > of the UMA stuff and I think there's some VM pagetable work that needs > doing. I haven't dug into the latter because I was trusting that those > working on projects/numa will get to it like they've said, but .. > well, it's still not done. >=20 > =46rom a UMA API perspective, it isn't trying very hard to return memory > back to the domain that it was allocated from. I'm worried that on a > real NUMA box with no policy configured or used by userland threads > (ie, everything is being scheduled everywhere), we'll end up with > threads allocating from a CPU-local pool, being migrated to another > CPU, then freeing it there. So over time it'll end up with the cache > spiked with non-local pages. I talked to jhb about it but I don't > think we have a consensus about it at all. I think we should go the > extra mile of ensuring that when we return pages to UMA it goes back > into the right NUMA domain - I don't want to have to debug systems > that start really fast and then end up getting slower over time as > stuff ends up on the wrong CPU. But that's just me. >=20 > =46rom the physmem perspective, I'll likely implement an SRAT-aware > allocator option that allocates local-first, then tries to allocate > from locall-er domains until it just round-robins. We have that cost > matrix that says how expensive things are from a given NUMA domain, so > that shouldn't be too difficult to pre-compute. >=20 > =46rom a driver perspective, I've added the basic "which domain am I in" > call. John's been playing with a couple of drivers (igb/ixgbe I > think) in his local repository for teaching them about what is a local > interrupt and local CPU set to start threads on. Ideally all the > drivers would use the same API for querying what their local cpuset is > and assigning worker threads / interrupts appropriately. I should poke > him again to find the status of that work and at least get that into > -HEAD so it can be evaluated and used by people. >=20 > What I'm hoping to do with that in the short term is to make it > generic enough so that a generic, consistent set of hints can be > configured for a given driver to setup its worker thread and cpuset > map. Ie, instead of each driver having its own "how many queues" > sysctl and probe logic, it'll have some API to say "give me my number > of threads and local cpuset(s)" so if someone wants to override it, > it's done by something like code in the bus layer rather than above > individual driver hacks. We can also add options for things like "pin > threads" or not. >=20 > =46rom the driver /allocation/ perspective, there's a few things to think a= bout: >=20 > * how we allocate busdma memory for things like descriptors; > * how we allocate DMA memory for things like mbufs, bufs, etc - what > we're dma'ing into and out of. >=20 > None of that is currently specified, although there's a couple of > tidbits in the projects/numa branch that haven't been fleshed out. >=20 > In my vague drawing-on-paper sense of this, I think we also should > then extend busdma a little to be aware of a numaset for allocating > memory for descriptors. Same with calling malloc and contigmalloc. > Ideally we'd keep descriptor accesses in memory local to the device. >=20 > Now for things like bufs/mbufs/etc - the DMA bits - that's where it > gets a little tricky. Sometimes we're handed memory to send from/to. > The NIC RX path allocates memory itself. storage controllers get > handed memory to read storage data into. So, the higher layer question > of "how do we tell something where to allocate memory from" gets a > little complicated, because we (a) may have to allocate memory for a > non-local device, knowing we're going to hand it to some device on > another core, and (b) we then have to return it to the right NUMA > aware pool (UMA, vm_page stuff, etc.) >=20 > The other tricksy bit is when drivers want to allocate memory for > local ephemeral work (eg M_TEMP) versus DMA/descriptor/busdma stuff. > Here's about where I say "don't worry about this - do all the above > bits first then worry about worrying about this": >=20 > say we have a NIC driver on an 8-core CPU, and it's in a 4-socket box > (so 4 sockets * 8 cores each socket.) The receive threads can be told > to run in a specific cpuset local to the CPU the NIC is plugged into. > But the transmit threads, taskqueue bits, etc may run on any CPU - > remember, we don't queue traffic and wakeup a thread now; we call > if_transmit() from whichever CPU is running the transmit code. So if > it's local then it'll come from NUMA memory domain local memory and > stuff is fine. But if it's transmitting from some /remote/ thread: >=20 > * the mbuf that's been allocated to send data is likely allocated from > the wrong NUMA domain; > * the local memory used by the driver to do work should be coming from > memory local to that CPU, but it may call malloc/etc to get temporary > working memory - and that hopefully is also coming from a local memory > domain; > * if it has to allocate descriptor memory or something then it may > need to allocate memory from the remote memory domain; > * .. then touch the NIC hardware, which is remote. >=20 > This is where I say "screw it, don't get bogged down in the details." > It gets hairier when we think about what's local versus remote for > vm_page entries, because hey, who knows what the right thing to do > there is. But I think the right thing to do here is to not worry about > it and get the above stuff done. For NUMA aware network applications > they're likely going to be CPU set to threads local to the same domain > as the NIC, and if that isn't good enough, we'll have to come up with > some way to mark a socket as local to a NUMA domain so things like > mbufs, TCP timers, etc get scheduled appropriately. >=20 > (This is where I say "And this is where the whole RSS framework and > NUMA need to know about each other", but I'm not really ready for that > yet.) >=20 > So, that's my braindump of what I remember from discussing with others. >=20 > I think the stages to go here are: >=20 > * get a firmer idea of what's missing from projects/numa for UMA, > physmem and VM allocator / pagedaemon/etc stuff; > * get jhb's numaset code into -HEAD; > * get jhb's driver NUMA awareness bits out, add more generic hooks for > expressing this stuff via hints, and dump that into -HEAD; > * evaluate what we're going to do about mbufs, malloc/contigmalloc for > busdma, bounce buffers, descriptors, etc and get that into -HEAD; > * make sure the intel-pcm tools (and then hwpmc!) work for the uncore > CPU interconnects work so we can measure how well we are/aren't doing > things; > * come back to the drawing board once the above is done and we've got > some experience with it. >=20 > I haven't even started talking about what's needed for /userland/ > memory pages. I think that's part of what needs to be fleshed out / > discussed with the VM bits. >=20 >=20 >=20 > -adrian