Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 25 Sep 2019 22:03:14 -0700
From:      Mark Millard <marklmi@yahoo.com>
To:        Mark Johnston <markj@FreeBSD.org>
Cc:        freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
Message-ID:  <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com>
In-Reply-To: <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>
References:  <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>

next in thread | previous in thread | raw e-mail | index | archive | help


On 2019-Sep-25, at 20:27, Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Sep-25, at 19:26, Mark Millard <marklmi at yahoo.com> wrote:
>=20
>> On 2019-Sep-25, at 10:02, Mark Johnston <markj at reeBSD.org> wrote:
>>=20
>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via =
freebsd-amd64 wrote:
>>>> Note: I have access to only one FreeBSD amd64 context, and
>>>> it is also my only access to a NUMA context: 2 memory
>>>> domains. A Threadripper 1950X context. Also: I have only
>>>> a head FreeBSD context on any architecture, not 12.x or
>>>> before. So I have limited compare/contrast material.
>>>>=20
>>>> I present the below basically to ask if the NUMA handling
>>>> has been validated, or if it is going to be, at least for
>>>> contexts that might apply to ThreadRipper 1950X and
>>>> analogous contexts. My results suggest they are not (or
>>>> libc++'s now times get messed up such that it looks like
>>>> NUMA mishandling since this is based on odd benchmark
>>>> results that involve mean time for laps, using a median
>>>> of such across multiple trials).
>>>>=20
>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>>> 1950X got got expected  results on Fedora but odd ones on
>>>> FreeBSD. The benchmark is a variation on the old HINT
>>>> benchmark, spanning the old multi-threading variation. I
>>>> later tried Fedora because the FreeBSD results looked odd.
>>>> The other architectures I tried FreeBSD benchmarking with
>>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>>> Ed. For these I used 4 threads, not more.)
>>>>=20
>>>> I tend to write in terms of plots made from the data instead
>>>> of the raw benchmark data.
>>>>=20
>>>> FreeBSD testing based on:
>>>> cpuset -l0-15  -n prefer:1
>>>> cpuset -l16-31 -n prefer:1
>>>>=20
>>>> Fedora 30 testing based on:
>>>> numactl --preferred 1 --cpunodebind 0
>>>> numactl --preferred 1 --cpunodebind 1
>>>>=20
>>>> While I have more results, I reference primarily DSIZE
>>>> and ISIZE being unsigned long long and also both being
>>>> unsigned long as examples. Variations in results are not
>>>> from the type differences for any LP64 architectures.
>>>> (But they give an idea of benchmark variability in the
>>>> test context.)
>>>>=20
>>>> The Fedora results solidly show the bandwidth limitation
>>>> of using one memory controller. They also show the latency
>>>> consequences for the remote memory domain case vs. the
>>>> local memory domain case. There is not a lot of
>>>> variability between the examples of the 2 type-pairs used
>>>> for Fedora.
>>>>=20
>>>> Not true for FreeBSD on the 1950X:
>>>>=20
>>>> A) The latency-constrained part of the graph looks to
>>>> normally be using the local memory domain when
>>>> -l0-15 is in use for 8 threads.
>>>>=20
>>>> B) Both the -l0-15 and the -l16-31 parts of the
>>>> graph for 8 threads that should be bandwidth
>>>> limited show mostly examples that would have to
>>>> involve both memory controllers for the bandwidth
>>>> to get the results shown as far as I can tell.
>>>> There is also wide variability ranging between the
>>>> expected 1 controller result and, say, what a 2
>>>> controller round-robin would be expected produce.
>>>>=20
>>>> C) Even the single threaded result shows a higher
>>>> result for larger total bytes for the kernel
>>>> vectors. Fedora does not.
>>>>=20
>>>> I think that (B) is the most solid evidence for
>>>> something being odd.
>>>=20
>>> The implication seems to be that your benchmark program is using =
pages
>>> from both domains despite a policy which preferentially allocates =
pages
>>> from domain 1, so you would first want to determine if this is =
actually
>>> what's happening.  As far as I know we currently don't have a good =
way
>>> of characterizing per-domain memory usage within a process.
>>>=20
>>> If your benchmark uses a large fraction of the system's memory, you
>>> could use the vm.phys_free sysctl to get a sense of how much memory =
from
>>> each domain is free.
>>=20
>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per =
memory
>> domain. I've never configured the benchmark such that it even reaches
>> 10 GiBytes on this hardware. (It stops for a time constraint first,
>> based on the values in use for the "adjustable" items.)
>>=20
>> . . . (much omitted material) . . .
>=20
>>=20
>>> Another possibility is to use DTrace to trace the
>>> requested domain in vm_page_alloc_domain_after().  For example, the
>>> following DTrace one-liner counts the number of pages allocated per
>>> domain by ls(1):
>>>=20
>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls"
>>> ...
>>> 	0               71
>>> 	1               72
>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 =
ls"
>>> ...
>>> 	1              143
>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 =
ls"
>>> ...
>>> 	0              143
>>=20
>> I'll think about this, although it would give no
>> information which CPUs are executing the threads
>> that are allocating or accessing the vectors for
>> the integration kernel. So, for example, if the
>> threads migrate to or start out on cpus they
>> should not be on, this would not report such.
>>=20
>> For such "which CPUs" questions one stab would
>> be simply to watch with top while the benchmark
>> is running and see which CPUs end up being busy
>> vs. which do not. I think I'll try this.
>=20
> Using top did not show evidence of the wrong
> CPUs being actively in use.
>=20
> My variation of top is unusual in that it also
> tracks some maximum observed figures and reports
> them, here being:
>=20
> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir)
>=20
> (no swap use was reported). This gives a system
> level view of about how much RAM was put to use
> during the monitoring of the 2 benchmark runs
> (-l0-15 and -l16-31). No where near enough used
> to require both memory domains to be in use.
>=20
> Thus, it would appear to be just where the
> allocations are made for -n prefer:1 that
> matters, at least when a (temporary) thread
> does the allocations.
>=20
>>> This approach might not work for various reasons depending on how
>>> exactly your benchmark program works.
>=20
> I've not tried dtrace yet.

Well, for an example -l0-15 -n prefer:1 run
for just the 8 threads benchmark case . . .

dtrace: pid 10997 has exited

        0              712
        1          6737529

Something is leading to domain 0
allocations, despite -n prefer:1 .

So I tried -l16-31 -n prefer:1 and it got:

dtrace: pid 11037 has exited

        0                2
        1          8055389

(The larger number of allocations is
not a surprise: more work done in
about the same overall time based on
faster memory access generally.)

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?26B47782-033B-40C8-B8F8-4C731B167243>