Date: Sat, 11 Jul 2015 11:29:19 -0500 From: Alan Cox <alc@rice.edu> To: Adrian Chadd <adrian@FreeBSD.org>, src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r285387 - in head: lib/libc/sys share/man/man4 sys/conf sys/kern sys/sys sys/vm usr.bin usr.bin/numactl Message-ID: <55A1445F.50901@rice.edu> In-Reply-To: <201507111521.t6BFLcrv039934@repo.freebsd.org> References: <201507111521.t6BFLcrv039934@repo.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 07/11/2015 10:21, Adrian Chadd wrote: > Author: adrian > Date: Sat Jul 11 15:21:37 2015 > New Revision: 285387 > URL: https://svnweb.freebsd.org/changeset/base/285387 > > Log: > Add an initial NUMA affinity/policy configuration for threads and pro= cesses. > =20 > This is based on work done by jeff@ and jhb@, as well as the numa.dif= f > patch that has been circulating when someone asks for first-touch NUM= A > on -10 or -11. > =20 > * Introduce a simple set of VM policy and iterator types. > * tie the policy types into the vm_phys path for now, mirroring how > the initial first-touch allocation work was enabled. > * add syscalls to control changing thread and process defaults. > * add a global NUMA VM domain policy. > * implement a simple cascade policy order - if a thread policy exists= , use it; > if a process policy exists, use it; use the default policy. > * processes inherit policies from their parent processes, threads inh= erit > policies from their parent threads. > * add a simple tool (numactl) to query and modify default thread/proc= ess > policities. > * add documentation for the new syscalls, for numa and for numactl. > * re-enable first touch NUMA again by default, as now policies can be= > set in a variety of methods. > =20 > This is only relevant for very specific workloads. > =20 > This doesn't pretend to be a final NUMA solution. > =20 > The previous defaults in -HEAD (with MAXMEMDOM set) can be achieved b= y > 'sysctl vm.default_policy=3Drr'. > =20 > This is only relevant if MAXMEMDOM is set to something other than 1. > Ie, if you're using GENERIC or a modified kernel with non-NUMA, then > this is a glorified no-op for you. > =20 > Thank you to Norse Corp for giving me access to rather large > (for FreeBSD!) NUMA machines in order to develop and verify this. > =20 > Thank you to Dell for providing me with dual socket sandybridge > and westmere v3 hardware to do NUMA development with. > =20 > Thank you to Scott Long at Netflix for providing me with access > to the two-socket, four-domain haswell v3 hardware. > =20 > Thank you to Peter Holm for running the stress testing suite > against the NUMA branch during various stages of development! > =20 > Tested: > =20 > * MIPS (regression testing; non-NUMA) > * i386 (regression testing; non-NUMA GENERIC) > * amd64 (regression testing; non-NUMA GENERIC) > * westmere, 2 socket (thankyou norse!) > * sandy bridge, 2 socket (thankyou dell!) > * ivy bridge, 2 socket (thankyou norse!) > * westmere-EX, 4 socket / 1TB RAM (thankyou norse!) > * haswell, 2 socket (thankyou norse!) > * haswell v3, 2 socket (thankyou dell) > * haswell v3, 2x18 core (thankyou scott long / netflix!) > =20 > * Peter Holm ran a stress test suite on this work and found one > issue, but has not been able to verify it (it doesn't look NUMA > related, and he only saw it once over many testing runs.) > =20 > * I've tested bhyve instances running in fixed NUMA domains and cpuse= ts; > all seems to work correctly. > =20 > Verified: > =20 > * intel-pcm - pcm-numa.x and pcm-memory.x, whilst selecting different= > NUMA policies for processes under test. > =20 > Review: > =20 > This was reviewed through phabricator (https://reviews.freebsd.org/D2= 559) > as well as privately and via emails to freebsd-arch@. The git histor= y > with specific attributes is available at https://github.com/erikarn/f= reebsd/ > in the NUMA branch (https://github.com/erikarn/freebsd/compare/local/= adrian_numa_policy). > =20 > This has been reviewed by a number of people (stas, rpaulo, kib, ngie= , > wblock) but not achieved a clear consensus. My hope is that with fur= ther > exposure and testing more functionality can be implemented and evalua= ted. > =20 > Notes: > =20 > * The VM doesn't handle unbalanced domains very well, and if you have= an overly > unbalanced memory setup whilst under high memory pressure, VM page = allocation > may fail leading to a kernel panic. This was a problem in the past= , but it's > much more easily triggered now with these tools. > =20 For the record, no, it doesn't panic. Both the first-touch scheme in 9.x and the round-robin scheme in 10.x fall back to allocating from a different domain until some page is found. Alan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55A1445F.50901>