From owner-svn-src-all@freebsd.org Sat Jul 11 16:29:29 2015 Return-Path: Delivered-To: svn-src-all@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5325D9993AA; Sat, 11 Jul 2015 16:29:29 +0000 (UTC) (envelope-from alc@rice.edu) Received: from pp1.rice.edu (proofpoint1.mail.rice.edu [128.42.201.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 27250241D; Sat, 11 Jul 2015 16:29:28 +0000 (UTC) (envelope-from alc@rice.edu) Received: from pps.filterd (pp1.rice.edu [127.0.0.1]) by pp1.rice.edu (8.15.0.59/8.15.0.59) with SMTP id t6BGT6Hb030643; Sat, 11 Jul 2015 11:29:20 -0500 Received: from mh11.mail.rice.edu (mh11.mail.rice.edu [128.42.199.30]) by pp1.rice.edu with ESMTP id 1vjvy383pm-1; Sat, 11 Jul 2015 11:29:20 -0500 X-Virus-Scanned: by amavis-2.7.0 at mh11.mail.rice.edu, auth channel Received: from 108-254-203-201.lightspeed.hstntx.sbcglobal.net (108-254-203-201.lightspeed.hstntx.sbcglobal.net [108.254.203.201]) (using TLSv1 with cipher RC4-MD5 (128/128 bits)) (No client certificate requested) (Authenticated sender: alc) by mh11.mail.rice.edu (Postfix) with ESMTPSA id B96D44C05CB; Sat, 11 Jul 2015 11:29:19 -0500 (CDT) Message-ID: <55A1445F.50901@rice.edu> Date: Sat, 11 Jul 2015 11:29:19 -0500 From: Alan Cox User-Agent: Mozilla/5.0 (X11; FreeBSD i386; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Adrian Chadd , src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r285387 - in head: lib/libc/sys share/man/man4 sys/conf sys/kern sys/sys sys/vm usr.bin usr.bin/numactl References: <201507111521.t6BFLcrv039934@repo.freebsd.org> In-Reply-To: <201507111521.t6BFLcrv039934@repo.freebsd.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 kscore.is_bulkscore=0 kscore.compositescore=1 compositescore=0.9 suspectscore=3 malwarescore=0 phishscore=0 bulkscore=0 kscore.is_spamscore=0 rbsscore=0.9 spamscore=0 urlsuspectscore=0.9 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1506180000 definitions=main-1507110278 X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 11 Jul 2015 16:29:29 -0000 On 07/11/2015 10:21, Adrian Chadd wrote: > Author: adrian > Date: Sat Jul 11 15:21:37 2015 > New Revision: 285387 > URL: https://svnweb.freebsd.org/changeset/base/285387 > > Log: > Add an initial NUMA affinity/policy configuration for threads and pro= cesses. > =20 > This is based on work done by jeff@ and jhb@, as well as the numa.dif= f > patch that has been circulating when someone asks for first-touch NUM= A > on -10 or -11. > =20 > * Introduce a simple set of VM policy and iterator types. > * tie the policy types into the vm_phys path for now, mirroring how > the initial first-touch allocation work was enabled. > * add syscalls to control changing thread and process defaults. > * add a global NUMA VM domain policy. > * implement a simple cascade policy order - if a thread policy exists= , use it; > if a process policy exists, use it; use the default policy. > * processes inherit policies from their parent processes, threads inh= erit > policies from their parent threads. > * add a simple tool (numactl) to query and modify default thread/proc= ess > policities. > * add documentation for the new syscalls, for numa and for numactl. > * re-enable first touch NUMA again by default, as now policies can be= > set in a variety of methods. > =20 > This is only relevant for very specific workloads. > =20 > This doesn't pretend to be a final NUMA solution. > =20 > The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved b= y > 'sysctl vm.default_policy=3Drr'. > =20 > This is only relevant if MAXMEMDOM is set to something other than 1. > Ie, if you're using GENERIC or a modified kernel with non-NUMA, then > this is a glorified no-op for you. > =20 > Thank you to Norse Corp for giving me access to rather large > (for FreeBSD!) NUMA machines in order to develop and verify this. > =20 > Thank you to Dell for providing me with dual socket sandybridge > and westmere v3 hardware to do NUMA development with. > =20 > Thank you to Scott Long at Netflix for providing me with access > to the two-socket, four-domain haswell v3 hardware. > =20 > Thank you to Peter Holm for running the stress testing suite > against the NUMA branch during various stages of development! > =20 > Tested: > =20 > * MIPS (regression testing; non-NUMA) > * i386 (regression testing; non-NUMA GENERIC) > * amd64 (regression testing; non-NUMA GENERIC) > * westmere, 2 socket (thankyou norse!) > * sandy bridge, 2 socket (thankyou dell!) > * ivy bridge, 2 socket (thankyou norse!) > * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) > * haswell, 2 socket (thankyou norse!) > * haswell v3, 2 socket (thankyou dell) > * haswell v3, 2x18 core (thankyou scott long / netflix!) > =20 > * Peter Holm ran a stress test suite on this work and found one > issue, but has not been able to verify it (it doesn't look NUMA > related, and he only saw it once over many testing runs.) > =20 > * I've tested bhyve instances running in fixed NUMA domains and cpuse= ts; > all seems to work correctly. > =20 > Verified: > =20 > * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different= > NUMA policies for processes under test. > =20 > Review: > =20 > This was reviewed through phabricator (https://reviews.freebsd.org/D2= 559) > as well as privately and via emails to freebsd-arch@. The git histor= y > with specific attributes is available at https://github.com/erikarn/f= reebsd/ > in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/= adrian_numa_policy). > =20 > This has been reviewed by a number of people (stas, rpaulo, kib, ngie= , > wblock) but not achieved a clear consensus. My hope is that with fur= ther > exposure and testing more functionality can be implemented and evalua= ted. > =20 > Notes: > =20 > * The VM doesn't handle unbalanced domains very well, and if you have= an overly > unbalanced memory setup whilst under high memory pressure, VM page = allocation > may fail leading to a kernel panic. This was a problem in the past= , but it's > much more easily triggered now with these tools. > =20 For the record, no, it doesn't panic. Both the first-touch scheme in 9.x and the round-robin scheme in 10.x fall back to allocating from a different domain until some page is found. Alan