From owner-freebsd-amd64@freebsd.org Fri Sep 27 00:05:45 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B6323137B26 for ; Fri, 27 Sep 2019 00:05:45 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic316-12.consmr.mail.bf2.yahoo.com (sonic316-12.consmr.mail.bf2.yahoo.com [74.6.130.122]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46fXBJ6xvQz47N6 for ; Fri, 27 Sep 2019 00:05:44 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: 3_ItcxwVM1lZiHF0FZxEeg80uDgclCXajJlSqXXF04R8cEiROEvhLVIt8FNwNRE Ygiz0wbI9u8umn2_fu88z._9KUl7_i52eAKPyAUpjPrNndGkp1ZRZN8DsjVj2gNhf.JwFJQD.pGS 5.Zvd06JgzkoowhAm7emk1H2DOfY3Ie8fJW_EqFLrPNDKgbgfqqGMDFBqfxBIaTFk3CPwTT0ySp7 G9LQacTjg3DR7oKNL8ebpJW3GbjY30Roh1dACkO32FF_6HvZ2VclbIaLCwSxrv57eS4y5jAviB_u 25DXPv380XdrSIDNTIBZHJl6qWKZYsPkQ9pwT5ll1EWy_vqMlkSf1FtcezHJzi2S2JFeXUI0yPHm hlllkbgAvCFYF9NGbUFGTg1rruYvVZT64PIDcW_T0suiuyhC3nxU78dRSfhhEJ51Hf3Tdr2aeOrU i55PctgrkfO1TSWdX1Oy6bIUqcRiGMQPub6P7leOa0YFJTFeJyrpR69b8kBGhuIDTN_7EDVCToeK V1UhGuirZwMqa0JBSeEQnXZ0rtbJ4x4sRx3okA6fRJ74G2uJn7svdY4W8bF3zSxzy6ugZby5dETz hfilrSWOggavjxne7sGqod2ACPWlSchVodAK6V.i0BwwwIEyrMDJGM3Hl8ycE_FEWxNBoKc3.9o0 slMvOFKGVNY_Khf8Sm0AZqM_naM6WrI7W66yEnLLFTb75W_yWn0Dj1Cb5WWcV_OIK7PLgRNEKaEC hHWArsFoiaHIuP.Pml86ZWEIhOKyeyFuwzuzICowZXKs3bUDw72RNIjJ9ThawZX.DubNJz0lmBcb EsI69n3QtI7jj9RPVbGhOJxSssa.l9.LRvP7lnBgO0dxwQMpVE.Z9V0.unTWmsqhpfE_xku3kvEF 5B2MEuIMXXdyBpxubcZPhYe6wnTiiPV7A0bhS5GDyDAGK1RI_BpvteO.sDXRmWc9Vo_NOroNEmNv 5QjKl0MasGQuRClOE9ZatvSVi7izTJ_c_Omc5EdI45sMu6wEy9gREpO8S3bgyJLRBaOWTJGyu7PF cjSM7.Ku5yA2jBkfzRbnkk99XCJgLsD9J4VvV2BbHsgXHJcE83DsceAh_WchZ5.xJ_H4RBlrT3FQ VNSYFgC6Wh3PpiTTJL_baRiIdpEPgLVsytLYDydRUNl10OzqcdLeprz0uU_0qTVWL6t.FiSOh2Om phuI17A7ZqKRh_fGzkkaFT0vhWcdNxmPvWNssgzRkHf400wwJbIsl8MTPzTXAUQfqNbWVAFEcdMI PHOBkFPIL6qM7mKRNAIRwjzCmNeHZcPPwdJmXdyMS3_Chb3ToxS_18xM- Received: from sonic.gate.mail.ne1.yahoo.com by sonic316.consmr.mail.bf2.yahoo.com with HTTP; Fri, 27 Sep 2019 00:05:43 +0000 Received: by smtp423.mail.bf1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 8aa6fc02efcbfd567d14245ffe470ebe; Fri, 27 Sep 2019 00:05:41 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <20190926202936.GD5581@raichu> Date: Thu, 26 Sep 2019 17:05:38 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <2DE123BE-B0F8-43F6-B950-F41CF0DEC8AD@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> <20190926202936.GD5581@raichu> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46fXBJ6xvQz47N6 X-Spamd-Bar: + X-Spamd-Result: default: False [1.69 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; SUBJECT_ENDS_QUESTION(1.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; ASN(0.00)[asn:26101, ipnet:74.6.128.0/21, country:US]; MID_RHS_MATCH_FROM(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; MIME_GOOD(-0.10)[text/plain]; IP_SCORE(0.00)[ip: (4.59), ipnet: 74.6.128.0/21(1.45), asn: 26101(1.16), country: US(-0.05)]; NEURAL_SPAM_MEDIUM(0.92)[0.916,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.28)[0.279,0]; RCVD_IN_DNSWL_NONE(0.00)[122.130.6.74.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Sep 2019 00:05:45 -0000 On 2019-Sep-26, at 13:29, Mark Johnston wrote: > On Wed, Sep 25, 2019 at 10:03:14PM -0700, Mark Millard wrote: >>=20 >>=20 >> On 2019-Sep-25, at 20:27, Mark Millard wrote: >>=20 >>> On 2019-Sep-25, at 19:26, Mark Millard wrote: >>>=20 >>>> On 2019-Sep-25, at 10:02, Mark Johnston = wrote: >>>>=20 >>>>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>>>>> Note: I have access to only one FreeBSD amd64 context, and >>>>>> it is also my only access to a NUMA context: 2 memory >>>>>> domains. A Threadripper 1950X context. Also: I have only >>>>>> a head FreeBSD context on any architecture, not 12.x or >>>>>> before. So I have limited compare/contrast material. >>>>>>=20 >>>>>> I present the below basically to ask if the NUMA handling >>>>>> has been validated, or if it is going to be, at least for >>>>>> contexts that might apply to ThreadRipper 1950X and >>>>>> analogous contexts. My results suggest they are not (or >>>>>> libc++'s now times get messed up such that it looks like >>>>>> NUMA mishandling since this is based on odd benchmark >>>>>> results that involve mean time for laps, using a median >>>>>> of such across multiple trials). >>>>>>=20 >>>>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>>>>> 1950X got got expected results on Fedora but odd ones on >>>>>> FreeBSD. The benchmark is a variation on the old HINT >>>>>> benchmark, spanning the old multi-threading variation. I >>>>>> later tried Fedora because the FreeBSD results looked odd. >>>>>> The other architectures I tried FreeBSD benchmarking with >>>>>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>>>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>>>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>>>>> Ed. For these I used 4 threads, not more.) >>>>>>=20 >>>>>> I tend to write in terms of plots made from the data instead >>>>>> of the raw benchmark data. >>>>>>=20 >>>>>> FreeBSD testing based on: >>>>>> cpuset -l0-15 -n prefer:1 >>>>>> cpuset -l16-31 -n prefer:1 >>>>>>=20 >>>>>> Fedora 30 testing based on: >>>>>> numactl --preferred 1 --cpunodebind 0 >>>>>> numactl --preferred 1 --cpunodebind 1 >>>>>>=20 >>>>>> While I have more results, I reference primarily DSIZE >>>>>> and ISIZE being unsigned long long and also both being >>>>>> unsigned long as examples. Variations in results are not >>>>>> from the type differences for any LP64 architectures. >>>>>> (But they give an idea of benchmark variability in the >>>>>> test context.) >>>>>>=20 >>>>>> The Fedora results solidly show the bandwidth limitation >>>>>> of using one memory controller. They also show the latency >>>>>> consequences for the remote memory domain case vs. the >>>>>> local memory domain case. There is not a lot of >>>>>> variability between the examples of the 2 type-pairs used >>>>>> for Fedora. >>>>>>=20 >>>>>> Not true for FreeBSD on the 1950X: >>>>>>=20 >>>>>> A) The latency-constrained part of the graph looks to >>>>>> normally be using the local memory domain when >>>>>> -l0-15 is in use for 8 threads. >>>>>>=20 >>>>>> B) Both the -l0-15 and the -l16-31 parts of the >>>>>> graph for 8 threads that should be bandwidth >>>>>> limited show mostly examples that would have to >>>>>> involve both memory controllers for the bandwidth >>>>>> to get the results shown as far as I can tell. >>>>>> There is also wide variability ranging between the >>>>>> expected 1 controller result and, say, what a 2 >>>>>> controller round-robin would be expected produce. >>>>>>=20 >>>>>> C) Even the single threaded result shows a higher >>>>>> result for larger total bytes for the kernel >>>>>> vectors. Fedora does not. >>>>>>=20 >>>>>> I think that (B) is the most solid evidence for >>>>>> something being odd. >>>>>=20 >>>>> The implication seems to be that your benchmark program is using = pages >>>>> from both domains despite a policy which preferentially allocates = pages >>>>> from domain 1, so you would first want to determine if this is = actually >>>>> what's happening. As far as I know we currently don't have a good = way >>>>> of characterizing per-domain memory usage within a process. >>>>>=20 >>>>> If your benchmark uses a large fraction of the system's memory, = you >>>>> could use the vm.phys_free sysctl to get a sense of how much = memory from >>>>> each domain is free. >>>>=20 >>>> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory >>>> domain. I've never configured the benchmark such that it even = reaches >>>> 10 GiBytes on this hardware. (It stops for a time constraint first, >>>> based on the values in use for the "adjustable" items.) >>>>=20 >>>> . . . (much omitted material) . . . >>>=20 >>>>=20 >>>>> Another possibility is to use DTrace to trace the >>>>> requested domain in vm_page_alloc_domain_after(). For example, = the >>>>> following DTrace one-liner counts the number of pages allocated = per >>>>> domain by ls(1): >>>>>=20 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >>>>> ... >>>>> 0 71 >>>>> 1 72 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >>>>> ... >>>>> 1 143 >>>>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >>>>> ... >>>>> 0 143 >>>>=20 >>>> I'll think about this, although it would give no >>>> information which CPUs are executing the threads >>>> that are allocating or accessing the vectors for >>>> the integration kernel. So, for example, if the >>>> threads migrate to or start out on cpus they >>>> should not be on, this would not report such. >>>>=20 >>>> For such "which CPUs" questions one stab would >>>> be simply to watch with top while the benchmark >>>> is running and see which CPUs end up being busy >>>> vs. which do not. I think I'll try this. >>>=20 >>> Using top did not show evidence of the wrong >>> CPUs being actively in use. >>>=20 >>> My variation of top is unusual in that it also >>> tracks some maximum observed figures and reports >>> them, here being: >>>=20 >>> 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) >>>=20 >>> (no swap use was reported). This gives a system >>> level view of about how much RAM was put to use >>> during the monitoring of the 2 benchmark runs >>> (-l0-15 and -l16-31). No where near enough used >>> to require both memory domains to be in use. >>>=20 >>> Thus, it would appear to be just where the >>> allocations are made for -n prefer:1 that >>> matters, at least when a (temporary) thread >>> does the allocations. >>>=20 >>>>> This approach might not work for various reasons depending on how >>>>> exactly your benchmark program works. >>>=20 >>> I've not tried dtrace yet. >>=20 >> Well, for an example -l0-15 -n prefer:1 run >> for just the 8 threads benchmark case . . . >>=20 >> dtrace: pid 10997 has exited >>=20 >> 0 712 >> 1 6737529 >>=20 >> Something is leading to domain 0 >> allocations, despite -n prefer:1 . >=20 > You can get a sense of where these allocations are occuring by = changing > the probe to capture kernel stacks for domain 0 page allocations: >=20 > fbt::vm_page_alloc_domain_after:entry /progenyof($target) && args[2] = =3D=3D 0/{@[stack()] =3D count();} >=20 > One possibility is that these are kernel memory allocations occurring = in > the context of the benchmark threads. Such allocations may not = respect > the configured policy since they are not private to the allocating > thread. For instance, upon opening a file, the kernel may allocate a > vnode structure for that file. That vnode may be accessed by threads > from many processes over its lifetime, and may be recycled many times > before its memory is released back to the allocator. For -l0-15 -n prefer:1 : Looks like this reports sys_thr_new activity, sys_cpuset activity, and 0xffffffff80bc09bd activity (whatever that is). Mostly sys_thr_new activity, over 1300 of them . . . dtrace: pid 13553 has exited kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_init+0x22 kernel`keg_alloc_slab+0x259 kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_alloc+0x23 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`cpuset_setproc+0x65 kernel`sys_cpuset+0x123 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`uma_zfree_arg+0x36a kernel`thread_reap+0x106 kernel`thread_alloc+0xf kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 6 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`uma_zfree_arg+0x36a kernel`vm_map_process_deferred+0x8c kernel`vm_map_remove+0x11d kernel`vmspace_exit+0xd3 kernel`exit1+0x5a9 kernel`0xffffffff80bc09bd kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 6 kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`thread_alloc+0x23 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 22 kernel`vm_page_grab_pages+0x1b4 kernel`vm_thread_stack_create+0xc0 kernel`kstack_import+0x52 kernel`uma_zalloc_arg+0x62b kernel`vm_thread_new+0x4d kernel`thread_alloc+0x31 kernel`thread_create+0x13a kernel`sys_thr_new+0xd2 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 1324 For -l16-31 -n prefer:1 : Again, exactly 2. Both being sys_cpuset . . . dtrace: pid 13594 has exited kernel`uma_small_alloc+0x61 kernel`keg_alloc_slab+0x10b kernel`zone_import+0x1d2 kernel`uma_zalloc_arg+0x62b kernel`cpuset_setproc+0x65 kernel`sys_cpuset+0x123 kernel`amd64_syscall+0x3ae kernel`0xffffffff811b7600 2 >=20 > Given the low number of domain 0 allocations I am skeptical that they > are responsible for the variablility in your results. >=20 >> So I tried -l16-31 -n prefer:1 and it got: >>=20 >> dtrace: pid 11037 has exited >>=20 >> 0 2 >> 1 8055389 >>=20 >> (The larger number of allocations is >> not a surprise: more work done in >> about the same overall time based on >> faster memory access generally.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)