Date: Wed, 25 Sep 2019 13:02:55 -0400 From: Mark Johnston <markj@freebsd.org> To: Mark Millard <marklmi@yahoo.com> Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? Message-ID: <20190925170255.GA43643@raichu> In-Reply-To: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via freebsd-amd64 wrote: > Note: I have access to only one FreeBSD amd64 context, and > it is also my only access to a NUMA context: 2 memory > domains. A Threadripper 1950X context. Also: I have only > a head FreeBSD context on any architecture, not 12.x or > before. So I have limited compare/contrast material. > > I present the below basically to ask if the NUMA handling > has been validated, or if it is going to be, at least for > contexts that might apply to ThreadRipper 1950X and > analogous contexts. My results suggest they are not (or > libc++'s now times get messed up such that it looks like > NUMA mishandling since this is based on odd benchmark > results that involve mean time for laps, using a median > of such across multiple trials). > > I ran a benchmark on both Fedora 30 and FreeBSD 13 on this > 1950X got got expected results on Fedora but odd ones on > FreeBSD. The benchmark is a variation on the old HINT > benchmark, spanning the old multi-threading variation. I > later tried Fedora because the FreeBSD results looked odd. > The other architectures I tried FreeBSD benchmarking with > did not look odd like this. (powerpc64 on a old PowerMac 2 > socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive > 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd > Ed. For these I used 4 threads, not more.) > > I tend to write in terms of plots made from the data instead > of the raw benchmark data. > > FreeBSD testing based on: > cpuset -l0-15 -n prefer:1 > cpuset -l16-31 -n prefer:1 > > Fedora 30 testing based on: > numactl --preferred 1 --cpunodebind 0 > numactl --preferred 1 --cpunodebind 1 > > While I have more results, I reference primarily DSIZE > and ISIZE being unsigned long long and also both being > unsigned long as examples. Variations in results are not > from the type differences for any LP64 architectures. > (But they give an idea of benchmark variability in the > test context.) > > The Fedora results solidly show the bandwidth limitation > of using one memory controller. They also show the latency > consequences for the remote memory domain case vs. the > local memory domain case. There is not a lot of > variability between the examples of the 2 type-pairs used > for Fedora. > > Not true for FreeBSD on the 1950X: > > A) The latency-constrained part of the graph looks to > normally be using the local memory domain when > -l0-15 is in use for 8 threads. > > B) Both the -l0-15 and the -l16-31 parts of the > graph for 8 threads that should be bandwidth > limited show mostly examples that would have to > involve both memory controllers for the bandwidth > to get the results shown as far as I can tell. > There is also wide variability ranging between the > expected 1 controller result and, say, what a 2 > controller round-robin would be expected produce. > > C) Even the single threaded result shows a higher > result for larger total bytes for the kernel > vectors. Fedora does not. > > I think that (B) is the most solid evidence for > something being odd. The implication seems to be that your benchmark program is using pages from both domains despite a policy which preferentially allocates pages from domain 1, so you would first want to determine if this is actually what's happening. As far as I know we currently don't have a good way of characterizing per-domain memory usage within a process. If your benchmark uses a large fraction of the system's memory, you could use the vm.phys_free sysctl to get a sense of how much memory from each domain is free. Another possibility is to use DTrace to trace the requested domain in vm_page_alloc_domain_after(). For example, the following DTrace one-liner counts the number of pages allocated per domain by ls(1): # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n rr ls" ... 0 71 1 72 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:1 ls" ... 1 143 # dtrace -n 'fbt::vm_page_alloc_domain_after:entry /progenyof($target)/{@[args[2]] = count();}' -c "cpuset -n prefer:0 ls" ... 0 143 This approach might not work for various reasons depending on how exactly your benchmark program works.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20190925170255.GA43643>