From owner-freebsd-arch@FreeBSD.ORG Fri Feb 20 08:17:11 2015 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTPS id 2A9A0F18; Fri, 20 Feb 2015 08:17:11 +0000 (UTC) Received: from mail-yk0-x22d.google.com (mail-yk0-x22d.google.com [IPv6:2607:f8b0:4002:c07::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D54FE3B1; Fri, 20 Feb 2015 08:17:10 +0000 (UTC) Received: by mail-yk0-f173.google.com with SMTP id 19so6392025ykq.4; Fri, 20 Feb 2015 00:17:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=cGUScDvj2ODwc+hnYtMMXlRG5Jr8BdFm+jdvmOEcQII=; b=NEdS7+pCkhqcu8JJCzkDJELW8PoFN6bnONxMBSdw1s49mGuV5OZ3munbda3stEYzPr JAZwGZFvQRIpR7BvJ6jkKuYsmXWDYOT9XlpxDuG/eYuTtMgHw89nV7fj/kpw3zjVgD8z hMoV2acbIb1JCe/krSoRP1ccOeGrPl/cEwryb2zaL8NnJJkGi10Ia7dNUpNZ8gvmZJDt LY7fQnInls54spynjcrYuydCZSeMoP1WEbEijLdPXzuC2VTFdj6IiycEyqvPtH207lXa 4N0L4lliPL84MWxCAvGRTNGpwFb3XDjelYGccPeYEU2mqNRTvcqOEgdTariGb/bBjSHJ ibGQ== MIME-Version: 1.0 X-Received: by 10.236.230.103 with SMTP id i97mr5593343yhq.47.1424420229973; Fri, 20 Feb 2015 00:17:09 -0800 (PST) Sender: kmacybsd@gmail.com Received: by 10.170.76.66 with HTTP; Fri, 20 Feb 2015 00:17:09 -0800 (PST) In-Reply-To: References: <20150219041012.GJ1953@funkthat.com> <83795148.GHHzUeRKp6@ralph.baldwin.cx> Date: Fri, 20 Feb 2015 00:17:09 -0800 X-Google-Sender-Auth: L5BvS7vN1YYpXvkL9p17rRl2BIs Message-ID: Subject: Re: getting NUMA into the tree (userland most interesting for me) From: "K. Macy" To: Adrian Chadd Content-Type: text/plain; charset=UTF-8 Cc: John Baldwin , Alan Cox , John-Mark Gurney , Konstantin Belousov , "freebsd-arch@freebsd.org" X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.18-1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 Feb 2015 08:17:11 -0000 >>> Yes, I think we have a fair bit to do in the kernel before we are in a >>> position to export anything truly useful to userland unfortunately. The last >>> time I talked with Jeff about projects/numa (after the first draft of the wiki >>> page) I came away with the impression that there might be some things we can >>> pull out of that branch, but that it isn't suitable for merging upstream >>> directly. Jeff noted that he and Alan had gone through several iterations of >>> this already (I believe at least 3 completely different policy designs) all of >>> which had their own issues. >>> >>> Outside of the VM I think that we can keep the APIs somewhat stable by having >>> this opaque policy cookie to pass around that we can redefine the guts of >>> later. However, various parts of the VM all have to handle whatever the >>> policy defines, and while the vm_phys bits and contigmalloc() might be kind of >>> obvious to implement, higher level VM layers like kmem() and malloc() are more >>> complicated. One thing that is in projects/numa is changes for UMA that we >>> can hopefully reuse much of, but I don't recall how much (if any) of >>> kmem/malloc is in there. Also, while vm_phys is one of the first things to >>> do, I know that Alan and Jeff have pending patches to remove the cache queue >>> (since it is far less useful than it seems) which simplify vm_phys making it >>> easier to implement NUMA policies there, so I'm hoping we can get that in >>> sooner before having to start tearing up the VM too much. This is why the >>> stuff I currently have is targeted non-VM bits like interrupts as getting that >>> correct is lower-hanging fruit that might provide some gains regardless. Even >>> once vm_phys is done I think the first thing to tackle next is contigmalloc to >>> facilitate static bus_dma allocations (descriptor rings and such) being local >>> to a device. >>> >> >> Contigmalloc improvements and cache queue removal are in the >> phabricator queue now. They are also prerequisites for per-cpu free >> page caches which are a huge scalability improvement for some >> workloads such as Netflix's. >> >> There is still a fair amount of scalability work (including Jeffr's >> per-domain pagedaemon work) that really needs to happens before we can >> seriously think about a general user-level NUMA interface. > > Is there anything wrong with maybe bringing over the basic low level > allocator changes from projects/numa so the basics are there? I think they're probably predicated on the work that is being shepherded in now. Even if not, it would require someone to shepherd it in and the corresponding spare cycles from alc to review / revise / repeat - which seem to be in short supply. -K