From owner-freebsd-amd64@freebsd.org  Sat Sep 28 18:34:20 2019
Return-Path: <owner-freebsd-amd64@freebsd.org>
Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id EA49E12D073
 for <freebsd-amd64@mailman.nyi.freebsd.org>;
 Sat, 28 Sep 2019 18:34:20 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic317-20.consmr.mail.gq1.yahoo.com
 (sonic317-20.consmr.mail.gq1.yahoo.com [98.137.66.146])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46gckz1QYRz4Sns
 for <freebsd-amd64@freebsd.org>; Sat, 28 Sep 2019 18:34:18 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: EDdeC6IVM1krEiBXZ8v5sAfnrmJjKIx3Sxmqzw8iq0FIImjWcw0WLodyjzcMMpS
 NDfkhwE1P9CtMNOYAsSNKtTUEwb_0LqiserxlFDv3tgHnHcUMk4oxdOVMc1sgEXtnGA551fs7cXY
 DhIt0LLsUcAQyQeDYgEUCjSUcVV2lchAvcRNfhUmLM3ZM_w7NE62JRfNWvLyua3fSzOCGpJBZ3jT
 8Uasw_7pqLCihg0RuzNH_UnWWrw99ushnOb0zLY3QbhfW2010XQzxNh2itSPboXsGM.sGnv8B5dZ
 qHSXJlEu515yZiOT0865Zbhi8ZwoH8wyHpQCBYg7533lRMNRBUX74urojMC23myNv3ahR33kys7K
 oUZtx0n2l4YyfLj_48V4DYm5sfHw9bsVcRvyRLFW2WtgpdzZsK7RpqOCP7pZG0KbmLPcGcrcM01r
 FWpfVpf9stFJIOOGNExleGCS.m5kCS36ZrxMp_K3SA.lyUThIzDe5rSdxgyE5wEM1F4Yn_7VAWAm
 aWwoTcxOuT5_aoOVbg1WlO.UPQ4oJFj1dwf0NHPHWKQRe26U9eehRzCWk_ov1Yf2BdptRUuxzMrk
 Lc4Z261Y.AinGHydAmks5ulSH.GbT5RdnqB85qyaXMqPj5X_QtJwnK7wWfQuO7L0zRFksMEeItzK
 G5SQszt9d.wScxfLSZKbWHdr6hzkCzZTfH27T3rFM31ULdVWfKt8Hfk6uxB.S9K4183r8Hk7rGr7
 ZAYXHPo_gXd3NUjHcyrMnvTjrkVdc_L_BSvv.89v6MZ3qGBZ3E5a_U403qTjAURGHuLsbxTceWl_
 5EwV017j99bjakDbra5IU2d5FD_ZQkHrLGhLnPP7Xsjbr0Dbd04yAUVr.t9o7WNvkpCdcuZPtOt7
 hKGIaZQruSXx7IuPAYfDX12TNlux8GbwWh7hrgVyGRrnTvpUrKS6XPUv7EhOLEwOlZGlS8CbCoKM
 lovBa.KVKuX4X2DYTCXhq7DN_BZ_VtKuIRExP.V91_2ydF5Sp0Q8lLGexcnZx0MMg19jVyCsXb0W
 DNPF4SVgPzmYLX_3gLUhzsGG8ZZ6uMNMAG7aFz2hEtJc5oqRZg9y.NSvEmwBrrhLkbfCeyN5R3us
 IW6w4wh7fbMeWYa6Nv9ZShpAYb9TevBZG3tCZICNhVLV2VLG4Gm07RW6oL4uUGSFxoqN1ZNlHbyf
 .MTudXxueJPjKs6U7QdUjSiUJrQnVHEwtrPvLDWzAbon1I5oBqOpFN0afg2JzPUTnPBo0subgoZp
 DV7ivvEt3q9hPaNco_xJmQp5OeAt_MdTN_u1BZH7h_SjmRQaV8OO0eG1taiCPu94-
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic317.consmr.mail.gq1.yahoo.com with HTTP; Sat, 28 Sep 2019 18:34:17 +0000
Received: by smtp418.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA
 ID f49bf7d670fe2a075c39795e746a2f14; 
 Sat, 28 Sep 2019 18:34:16 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n
 prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <D343E769-8BCD-4459-B5FA-7D8D2780D1C6@yahoo.com>
Date: Sat, 28 Sep 2019 11:34:15 -0700
Cc: freebsd-amd64@freebsd.org,
 freebsd-hackers@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <C918C5DE-6D79-4598-B6B2-53BB3A226FE4@yahoo.com>
References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
 <20190925170255.GA43643@raichu>
 <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
 <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com>
 <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com>
 <20190926202936.GD5581@raichu>
 <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com>
 <6BC5F6BE-5FC3-48FA-9873-B20141FEFDF5@yahoo.com>
 <20190927192434.GA93180@raichu>
 <08CA4DA1-131C-4B14-BB57-EAA22A8CD5D9@yahoo.com>
 <D343E769-8BCD-4459-B5FA-7D8D2780D1C6@yahoo.com>
To: Mark Johnston <markj@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 46gckz1QYRz4Sns
X-Spamd-Bar: /
X-Spamd-Result: default: False [0.13 / 15.00]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
 FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 SUBJECT_ENDS_QUESTION(1.00)[];
 FREEMAIL_ENVFROM(0.00)[yahoo.com];
 ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US];
 MID_RHS_MATCH_FROM(0.00)[];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 NEURAL_HAM_LONG(-0.67)[-0.673,0]; MIME_GOOD(-0.10)[text/plain];
 IP_SCORE(0.00)[ip: (3.32), ipnet: 98.137.64.0/21(0.94), asn: 36647(0.75),
 country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.30)[0.302,0];
 IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[];
 RCVD_IN_DNSWL_NONE(0.00)[146.66.137.98.list.dnswl.org : 127.0.5.0];
 RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-amd64@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the AMD64 platform <freebsd-amd64.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-amd64/>
List-Post: <mailto:freebsd-amd64@freebsd.org>
List-Help: <mailto:freebsd-amd64-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 28 Sep 2019 18:34:21 -0000


On 2019-Sep-27, at 15:22, Mark Millard <marklmi at yahoo.com> wrote:

> On 2019-Sep-27, at 13:52, Mark Millard <marklmi at yahoo.com> wrote:
>=20
>> On 2019-Sep-27, at 12:24, Mark Johnston <markj at FreeBSD.org> wrote:
>>=20
>>> On Thu, Sep 26, 2019 at 08:37:39PM -0700, Mark Millard wrote:
>>>>=20
>>>>=20
>>>> On 2019-Sep-26, at 17:05, Mark Millard <marklmi at yahoo.com> =
wrote:
>>>>=20
>>>>> On 2019-Sep-26, at 13:29, Mark Johnston <markj at FreeBSD.org> =
wrote:
>>>>>> One possibility is that these are kernel memory allocations =
occurring in
>>>>>> the context of the benchmark threads.  Such allocations may not =
respect
>>>>>> the configured policy since they are not private to the =
allocating
>>>>>> thread.  For instance, upon opening a file, the kernel may =
allocate a
>>>>>> vnode structure for that file.  That vnode may be accessed by =
threads
>>>>>> from many processes over its lifetime, and may be recycled many =
times
>>>>>> before its memory is released back to the allocator.
>>>>>=20
>>>>> For -l0-15 -n prefer:1 :
>>>>>=20
>>>>> Looks like this reports sys_thr_new activity, sys_cpuset
>>>>> activity, and 0xffffffff80bc09bd activity (whatever that
>>>>> is). Mostly sys_thr_new activity, over 1300 of them . . .
>>>>>=20
>>>>> dtrace: pid 13553 has exited
>>>>>=20
>>>>>=20
>>>>>             kernel`uma_small_alloc+0x61
>>>>>             kernel`keg_alloc_slab+0x10b
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`thread_init+0x22
>>>>>             kernel`keg_alloc_slab+0x259
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`thread_alloc+0x23
>>>>>             kernel`thread_create+0x13a
>>>>>             kernel`sys_thr_new+0xd2
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>               2
>>>>>=20
>>>>>             kernel`uma_small_alloc+0x61
>>>>>             kernel`keg_alloc_slab+0x10b
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`cpuset_setproc+0x65
>>>>>             kernel`sys_cpuset+0x123
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>               2
>>>>>=20
>>>>>             kernel`uma_small_alloc+0x61
>>>>>             kernel`keg_alloc_slab+0x10b
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`uma_zfree_arg+0x36a
>>>>>             kernel`thread_reap+0x106
>>>>>             kernel`thread_alloc+0xf
>>>>>             kernel`thread_create+0x13a
>>>>>             kernel`sys_thr_new+0xd2
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>               6
>>>>>=20
>>>>>             kernel`uma_small_alloc+0x61
>>>>>             kernel`keg_alloc_slab+0x10b
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`uma_zfree_arg+0x36a
>>>>>             kernel`vm_map_process_deferred+0x8c
>>>>>             kernel`vm_map_remove+0x11d
>>>>>             kernel`vmspace_exit+0xd3
>>>>>             kernel`exit1+0x5a9
>>>>>             kernel`0xffffffff80bc09bd
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>               6
>>>>>=20
>>>>>             kernel`uma_small_alloc+0x61
>>>>>             kernel`keg_alloc_slab+0x10b
>>>>>             kernel`zone_import+0x1d2
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`thread_alloc+0x23
>>>>>             kernel`thread_create+0x13a
>>>>>             kernel`sys_thr_new+0xd2
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>              22
>>>>>=20
>>>>>             kernel`vm_page_grab_pages+0x1b4
>>>>>             kernel`vm_thread_stack_create+0xc0
>>>>>             kernel`kstack_import+0x52
>>>>>             kernel`uma_zalloc_arg+0x62b
>>>>>             kernel`vm_thread_new+0x4d
>>>>>             kernel`thread_alloc+0x31
>>>>>             kernel`thread_create+0x13a
>>>>>             kernel`sys_thr_new+0xd2
>>>>>             kernel`amd64_syscall+0x3ae
>>>>>             kernel`0xffffffff811b7600
>>>>>            1324
>>>>=20
>>>> With sys_thr_new not respecting -n prefer:1 for
>>>> -l0-15 (especially for the thread stacks), I
>>>> looked some at the generated integration kernel
>>>> code and it makes significant use of %rsp based
>>>> memory accesses (read and write).
>>>>=20
>>>> That would get both memory controllers going in
>>>> parallel (kernel vectors accesses to the preferred
>>>> memory domain), so not slowing down as expected.
>>>>=20
>>>> If round-robin is not respected for thread stacks,
>>>> and if threads migrate cpus across memory domains
>>>> at times, there could be considerable variability
>>>> for that context as well. (This may not be the
>>>> only way to have different/extra variability for
>>>> this context.)
>>>>=20
>>>> Overall: I'd be surprised if this was not
>>>> contributing to what I thought was odd about
>>>> the benchmark results.
>>>=20
>>> Your tracing refers to kernel thread stacks though, not the stacks =
used
>>> by threads when executing in user mode.  My understanding is that a =
HINT
>>> implementation would spend virtually all of its time in user mode, =
so it
>>> shouldn't matter much or at all if kernel thread stacks are backed =
by
>>> memory from the "wrong" domain.
>>=20
>> Looks like I was trying to think about it when I should have been =
sleeping.
>> You are correct.
>>=20
>>> This also doesn't really explain some of the disparities in the =
plots
>>> you sent me.  For instance, you get a much higher peak QUIS on =
FreeBSD
>>> than on Fedora with 16 threads and an interleave/round-robin domain
>>> selection policy.
>>=20
>> True. I suppose that there is the possibility that steady_clock's =
now() results
>> are odd for some reason for the type of context, leading to the =
durations
>> between such being on the short side where things look different.
>>=20
>> But the left hand side of the single-thread results (smaller memory =
sizes for
>> the vectors for the integration kernel's use) do not show such a =
rescaling.
>> (The single thread time measurements are strictly inside the thread =
of
>> execution, no thread creation or such counted for any size.) The =
right hand
>> side of the single thread results (larger memory use, making smaller =
cache
>> levels fairly ineffective) do generally show some rescaling, but not =
as drastic
>> as multi-threaded.
>>=20
>> Both round-robin and prefer:1  showed such for single threaded.
>=20
> Just to be explicit about what would be executed in the FreeBSD
> kernel . . .
>=20
> One difference between single-threaded vs. multi-threaded for
> the benchmark code is that the multi-threaded calls steady_clock's
> now from the main thread, counting time that thread creations
> contribute. Single-threaded calls steady_clock's now from inside
> the same thread that executes the integration kernel, not counting
> thread creation.
>=20
> steady_clock's now uses sys calls requesting CLOCK_MONOTONIC
> from what I've seen with truss.
>=20
> This would be code involved from the FreeBSD kernel that could
> contribute some to the measured time.
>=20
> Having the kernel stack for this on the memory domain where the
> time-measuring-CPU is vs. on a remote memory domain might
> make some difference in duration results. (But I've no clue
> specifically what to expect for the differences for my context so it
> may well not explain much of anything.)

In case anyone else is following along.

Gradually exploring different contexts is
isolating the plot characteristics. The
scale difference vs. Fedora 30 seems to always
exist. But it turns out that the messy right
hand side of the plots (widely variable
QUality Improvment Per Second figures compared
to the expected structure for QUIPS results)
for prefer:N is specific to prefer:1, for example.
(Only 2 memory domains available in my testing
context.)

I've sent Mark Johnston 3 more plots because
(not in time of discovery order):

A) I discovered that a non-NUMA kernel does
   not show the variability issue for either
   -l0-15 or -l16-31 for cpuset: both get
   fairly clean results, showing a clear
   difference between local vs. remote memory
   being involved as well.

B) For the NUMA kernel, prefer:0 is like (A)
   above: again not widely variable. This is
   unlike the prefer:1 result. So prefer:0
   and prefer:1 are not near being symmetric
   (swapping -l0-15 vs. -l16-31 status as
   well).

C) The non-NUMA kernel context without CPU
   restrictions is messy on the right hand side
   of the plot, like the round-robin results
   were. Both this and round-robin have a
   subset of the CPU activity that is analogous
   to prefer:1 above, so this may not be
   surprising, given the prefer:1 results.

So, for now, the primary question is why
prefer:0 vs. prefer:1 is not (nearly)
symmetric in the benchmark results on the
right hand side of the plots. ("prefer"
with cpu restrictions provides a means of
controlling the behavior and seeing a
comparison/contrast.)

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)