From owner-freebsd-amd64@freebsd.org  Fri Sep 27 00:05:45 2019
Return-Path: <owner-freebsd-amd64@freebsd.org>
Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id B6323137B26
 for <freebsd-amd64@mailman.nyi.freebsd.org>;
 Fri, 27 Sep 2019 00:05:45 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic316-12.consmr.mail.bf2.yahoo.com
 (sonic316-12.consmr.mail.bf2.yahoo.com [74.6.130.122])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46fXBJ6xvQz47N6
 for <freebsd-amd64@freebsd.org>; Fri, 27 Sep 2019 00:05:44 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: 3_ItcxwVM1lZiHF0FZxEeg80uDgclCXajJlSqXXF04R8cEiROEvhLVIt8FNwNRE
 Ygiz0wbI9u8umn2_fu88z._9KUl7_i52eAKPyAUpjPrNndGkp1ZRZN8DsjVj2gNhf.JwFJQD.pGS
 5.Zvd06JgzkoowhAm7emk1H2DOfY3Ie8fJW_EqFLrPNDKgbgfqqGMDFBqfxBIaTFk3CPwTT0ySp7
 G9LQacTjg3DR7oKNL8ebpJW3GbjY30Roh1dACkO32FF_6HvZ2VclbIaLCwSxrv57eS4y5jAviB_u
 25DXPv380XdrSIDNTIBZHJl6qWKZYsPkQ9pwT5ll1EWy_vqMlkSf1FtcezHJzi2S2JFeXUI0yPHm
 hlllkbgAvCFYF9NGbUFGTg1rruYvVZT64PIDcW_T0suiuyhC3nxU78dRSfhhEJ51Hf3Tdr2aeOrU
 i55PctgrkfO1TSWdX1Oy6bIUqcRiGMQPub6P7leOa0YFJTFeJyrpR69b8kBGhuIDTN_7EDVCToeK
 V1UhGuirZwMqa0JBSeEQnXZ0rtbJ4x4sRx3okA6fRJ74G2uJn7svdY4W8bF3zSxzy6ugZby5dETz
 hfilrSWOggavjxne7sGqod2ACPWlSchVodAK6V.i0BwwwIEyrMDJGM3Hl8ycE_FEWxNBoKc3.9o0
 slMvOFKGVNY_Khf8Sm0AZqM_naM6WrI7W66yEnLLFTb75W_yWn0Dj1Cb5WWcV_OIK7PLgRNEKaEC
 hHWArsFoiaHIuP.Pml86ZWEIhOKyeyFuwzuzICowZXKs3bUDw72RNIjJ9ThawZX.DubNJz0lmBcb
 EsI69n3QtI7jj9RPVbGhOJxSssa.l9.LRvP7lnBgO0dxwQMpVE.Z9V0.unTWmsqhpfE_xku3kvEF
 5B2MEuIMXXdyBpxubcZPhYe6wnTiiPV7A0bhS5GDyDAGK1RI_BpvteO.sDXRmWc9Vo_NOroNEmNv
 5QjKl0MasGQuRClOE9ZatvSVi7izTJ_c_Omc5EdI45sMu6wEy9gREpO8S3bgyJLRBaOWTJGyu7PF
 cjSM7.Ku5yA2jBkfzRbnkk99XCJgLsD9J4VvV2BbHsgXHJcE83DsceAh_WchZ5.xJ_H4RBlrT3FQ
 VNSYFgC6Wh3PpiTTJL_baRiIdpEPgLVsytLYDydRUNl10OzqcdLeprz0uU_0qTVWL6t.FiSOh2Om
 phuI17A7ZqKRh_fGzkkaFT0vhWcdNxmPvWNssgzRkHf400wwJbIsl8MTPzTXAUQfqNbWVAFEcdMI
 PHOBkFPIL6qM7mKRNAIRwjzCmNeHZcPPwdJmXdyMS3_Chb3ToxS_18xM-
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic316.consmr.mail.bf2.yahoo.com with HTTP; Fri, 27 Sep 2019 00:05:43 +0000
Received: by smtp423.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA
 ID 8aa6fc02efcbfd567d14245ffe470ebe; 
 Fri, 27 Sep 2019 00:05:41 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n
 prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <20190926202936.GD5581@raichu>
Date: Thu, 26 Sep 2019 17:05:38 -0700
Cc: freebsd-amd64@freebsd.org,
 freebsd-hackers@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com>
References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
 <20190925170255.GA43643@raichu>
 <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
 <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>
 <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com>
 <20190926202936.GD5581@raichu>
To: Mark Johnston <markj@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 46fXBJ6xvQz47N6
X-Spamd-Bar: +
X-Spamd-Result: default: False [1.69 / 15.00]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
 FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 SUBJECT_ENDS_QUESTION(1.00)[];
 FREEMAIL_ENVFROM(0.00)[yahoo.com];
 ASN(0.00)[asn:26101, ipnet:74.6.128.0/21, country:US];
 MID_RHS_MATCH_FROM(0.00)[];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 MIME_GOOD(-0.10)[text/plain];
 IP_SCORE(0.00)[ip: (4.59), ipnet: 74.6.128.0/21(1.45), asn: 26101(1.16),
 country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.92)[0.916,0];
 IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[];
 NEURAL_SPAM_LONG(0.28)[0.279,0];
 RCVD_IN_DNSWL_NONE(0.00)[122.130.6.74.list.dnswl.org : 127.0.5.0];
 RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-amd64@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the AMD64 platform <freebsd-amd64.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-amd64/>
List-Post: <mailto:freebsd-amd64@freebsd.org>
List-Help: <mailto:freebsd-amd64-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 27 Sep 2019 00:05:45 -0000


On 2019-Sep-26, at 13:29, Mark Johnston <markj at FreeBSD.org> wrote:

> On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote:
>>=20
>>=20
>> On 2019-Sep-25, at 20:27, Mark Millard <marklmi at yahoo.com> wrote:
>>=20
>>> On 2019-Sep-25, at 19:26, Mark Millard <marklmi at yahoo.com> wrote:
>>>=20
>>>> On 2019-Sep-25, at 10:02, Mark Johnston <markj at reeBSD.org> =
wrote:
>>>>=20
>>>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via =
freebsd-amd64 wrote:
>>>>>> Note: I have access to only one FreeBSD amd64 context, and
>>>>>> it is also my only access to a NUMA context: 2 memory
>>>>>> domains. A Threadripper 1950X context. Also: I have only
>>>>>> a head FreeBSD context on any architecture, not 12.x or
>>>>>> before. So I have limited compare/contrast material.
>>>>>>=20
>>>>>> I present the below basically to ask if the NUMA handling
>>>>>> has been validated, or if it is going to be, at least for
>>>>>> contexts that might apply to ThreadRipper 1950X and
>>>>>> analogous contexts. My results suggest they are not (or
>>>>>> libc++'s now times get messed up such that it looks like
>>>>>> NUMA mishandling since this is based on odd benchmark
>>>>>> results that involve mean time for laps, using a median
>>>>>> of such across multiple trials).
>>>>>>=20
>>>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>>>>>> 1950X got got expected  results on Fedora but odd ones on
>>>>>> FreeBSD. The benchmark is a variation on the old HINT
>>>>>> benchmark, spanning the old multi-threading variation. I
>>>>>> later tried Fedora because the FreeBSD results looked odd.
>>>>>> The other architectures I tried FreeBSD benchmarking with
>>>>>> did not look odd like this. (powerpc64 on a old PowerMac 2
>>>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>>>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>>>>>> Ed. For these I used 4 threads, not more.)
>>>>>>=20
>>>>>> I tend to write in terms of plots made from the data instead
>>>>>> of the raw benchmark data.
>>>>>>=20
>>>>>> FreeBSD testing based on:
>>>>>> cpuset -l0-15  -n prefer:1
>>>>>> cpuset -l16-31 -n prefer:1
>>>>>>=20
>>>>>> Fedora 30 testing based on:
>>>>>> numactl --preferred 1 --cpunodebind 0
>>>>>> numactl --preferred 1 --cpunodebind 1
>>>>>>=20
>>>>>> While I have more results, I reference primarily DSIZE
>>>>>> and ISIZE being unsigned long long and also both being
>>>>>> unsigned long as examples. Variations in results are not
>>>>>> from the type differences for any LP64 architectures.
>>>>>> (But they give an idea of benchmark variability in the
>>>>>> test context.)
>>>>>>=20
>>>>>> The Fedora results solidly show the bandwidth limitation
>>>>>> of using one memory controller. They also show the latency
>>>>>> consequences for the remote memory domain case vs. the
>>>>>> local memory domain case. There is not a lot of
>>>>>> variability between the examples of the 2 type-pairs used
>>>>>> for Fedora.
>>>>>>=20
>>>>>> Not true for FreeBSD on the 1950X:
>>>>>>=20
>>>>>> A) The latency-constrained part of the graph looks to
>>>>>> normally be using the local memory domain when
>>>>>> -l0-15 is in use for 8 threads.
>>>>>>=20
>>>>>> B) Both the -l0-15 and the -l16-31 parts of the
>>>>>> graph for 8 threads that should be bandwidth
>>>>>> limited show mostly examples that would have to
>>>>>> involve both memory controllers for the bandwidth
>>>>>> to get the results shown as far as I can tell.
>>>>>> There is also wide variability ranging between the
>>>>>> expected 1 controller result and, say, what a 2
>>>>>> controller round-robin would be expected produce.
>>>>>>=20
>>>>>> C) Even the single threaded result shows a higher
>>>>>> result for larger total bytes for the kernel
>>>>>> vectors. Fedora does not.
>>>>>>=20
>>>>>> I think that (B) is the most solid evidence for
>>>>>> something being odd.
>>>>>=20
>>>>> The implication seems to be that your benchmark program is using =
pages
>>>>> from both domains despite a policy which preferentially allocates =
pages
>>>>> from domain 1, so you would first want to determine if this is =
actually
>>>>> what's happening.  As far as I know we currently don't have a good =
way
>>>>> of characterizing per-domain memory usage within a process.
>>>>>=20
>>>>> If your benchmark uses a large fraction of the system's memory, =
you
>>>>> could use the vm.phys_free sysctl to get a sense of how much =
memory from
>>>>> each domain is free.
>>>>=20
>>>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per =
memory
>>>> domain. I've never configured the benchmark such that it even =
reaches
>>>> 10 GiBytes on this hardware. (It stops for a time constraint first,
>>>> based on the values in use for the "adjustable" items.)
>>>>=20
>>>> . . . (much omitted material) . . .
>>>=20
>>>>=20
>>>>> Another possibility is to use DTrace to trace the
>>>>> requested domain in vm_page_alloc_domain_after().  For example, =
the
>>>>> following DTrace one-liner counts the number of pages allocated =
per
>>>>> domain by ls(1):
>>>>>=20
>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls"
>>>>> ...
>>>>> 	0               71
>>>>> 	1               72
>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 =
ls"
>>>>> ...
>>>>> 	1              143
>>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 =
ls"
>>>>> ...
>>>>> 	0              143
>>>>=20
>>>> I'll think about this, although it would give no
>>>> information which CPUs are executing the threads
>>>> that are allocating or accessing the vectors for
>>>> the integration kernel. So, for example, if the
>>>> threads migrate to or start out on cpus they
>>>> should not be on, this would not report such.
>>>>=20
>>>> For such "which CPUs" questions one stab would
>>>> be simply to watch with top while the benchmark
>>>> is running and see which CPUs end up being busy
>>>> vs. which do not. I think I'll try this.
>>>=20
>>> Using top did not show evidence of the wrong
>>> CPUs being actively in use.
>>>=20
>>> My variation of top is unusual in that it also
>>> tracks some maximum observed figures and reports
>>> them, here being:
>>>=20
>>> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir)
>>>=20
>>> (no swap use was reported). This gives a system
>>> level view of about how much RAM was put to use
>>> during the monitoring of the 2 benchmark runs
>>> (-l0-15 and -l16-31). No where near enough used
>>> to require both memory domains to be in use.
>>>=20
>>> Thus, it would appear to be just where the
>>> allocations are made for -n prefer:1 that
>>> matters, at least when a (temporary) thread
>>> does the allocations.
>>>=20
>>>>> This approach might not work for various reasons depending on how
>>>>> exactly your benchmark program works.
>>>=20
>>> I've not tried dtrace yet.
>>=20
>> Well, for an example -l0-15 -n prefer:1 run
>> for just the 8 threads benchmark case . . .
>>=20
>> dtrace: pid 10997 has exited
>>=20
>>        0              712
>>        1          6737529
>>=20
>> Something is leading to domain 0
>> allocations, despite -n prefer:1 .
>=20
> You can get a sense of where these allocations are occuring by =
changing
> the probe to capture kernel stacks for domain 0 page allocations:
>=20
> fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] =
=3D=3D 0/{@[stack()] =3D count();}
>=20
> One possibility is that these are kernel memory allocations occurring =
in
> the context of the benchmark threads.  Such allocations may not =
respect
> the configured policy since they are not private to the allocating
> thread.  For instance, upon opening a file, the kernel may allocate a
> vnode structure for that file.  That vnode may be accessed by threads
> from many processes over its lifetime, and may be recycled many times
> before its memory is released back to the allocator.

For -l0-15 -n prefer:1 :

Looks like this reports sys_thr_new activity, sys_cpuset
activity, and 0xffffffff80bc09bd activity (whatever that
is). Mostly sys_thr_new activity, over 1300 of them . . .

dtrace: pid 13553 has exited


              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`thread_init+0x22
              kernel`keg_alloc_slab+0x259
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`thread_alloc+0x23
              kernel`thread_create+0x13a
              kernel`sys_thr_new+0xd2
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
                2

              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`cpuset_setproc+0x65
              kernel`sys_cpuset+0x123
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
                2

              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`uma_zfree_arg+0x36a
              kernel`thread_reap+0x106
              kernel`thread_alloc+0xf
              kernel`thread_create+0x13a
              kernel`sys_thr_new+0xd2
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
                6

              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`uma_zfree_arg+0x36a
              kernel`vm_map_process_deferred+0x8c
              kernel`vm_map_remove+0x11d
              kernel`vmspace_exit+0xd3
              kernel`exit1+0x5a9
              kernel`0xffffffff80bc09bd
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
                6

              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`thread_alloc+0x23
              kernel`thread_create+0x13a
              kernel`sys_thr_new+0xd2
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
               22

              kernel`vm_page_grab_pages+0x1b4
              kernel`vm_thread_stack_create+0xc0
              kernel`kstack_import+0x52
              kernel`uma_zalloc_arg+0x62b
              kernel`vm_thread_new+0x4d
              kernel`thread_alloc+0x31
              kernel`thread_create+0x13a
              kernel`sys_thr_new+0xd2
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
             1324


For -l16-31 -n prefer:1 :

Again, exactly 2. Both being sys_cpuset . . .

dtrace: pid 13594 has exited


              kernel`uma_small_alloc+0x61
              kernel`keg_alloc_slab+0x10b
              kernel`zone_import+0x1d2
              kernel`uma_zalloc_arg+0x62b
              kernel`cpuset_setproc+0x65
              kernel`sys_cpuset+0x123
              kernel`amd64_syscall+0x3ae
              kernel`0xffffffff811b7600
                2


>=20
> Given the low number of domain 0 allocations I am skeptical that they
> are responsible for the variablility in your results.
>=20
>> So I tried -l16-31 -n prefer:1 and it got:
>>=20
>> dtrace: pid 11037 has exited
>>=20
>>        0                2
>>        1          8055389
>>=20
>> (The larger number of allocations is
>> not a surprise: more work done in
>> about the same overall time based on
>> faster memory access generally.)

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)