From owner-freebsd-amd64@freebsd.org Thu Sep 26 05:03:23 2019 Return-Path: Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id B4F03F89A2 for ; Thu, 26 Sep 2019 05:03:23 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic307-10.consmr.mail.ne1.yahoo.com (sonic307-10.consmr.mail.ne1.yahoo.com [66.163.190.33]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46f2rB3JsBz3J7H for ; Thu, 26 Sep 2019 05:03:22 +0000 (UTC) (envelope-from marklmi@yahoo.com) X-YMail-OSG: gxGVtbsVM1m2AYfwbXiTkVnrazwTpntbCSJvi7vir_fNUU2.vKBaGt_9YLw.0Mn YLN1qjeQaB8OBcGbOL32bbxjl8ts9gdslSmiEnLJIBf8bc54iJO6.il0w3P.TLMOfprMmeUCTTwc koEl_9zKVOrUdJCcuzZqr_09ANrvFE7H7DHh6YA_t4wjYi3HnWB5bJJtla3nt8z0h9YkJy_eQ1Mr zaICTt.sGlMk.k3PofJFPyk7UI5UjSd0ul8ULYdffKQlIcaUuyA0gOXEcg3YFT3kAiNXQtU5mByX iY_o_BkqZ_o9dR1CJbbI_ZfSuLDPUVLnJKmRKRUI4aauok5S6BpXc5xlILc8byvJgWwQTgeOSras MVyR70k5qljUaLCq3D4jiNmD.xt3PMG6mITaLc6iVAfy52l.n6c.p1eYZXfFpoOs2jcayn4F22li APftOunliPpuRo9Fn8O1WMLMmHDFWc9PnfC6lsW4ZvT1IGUThVYxUxd34y6tNMnhmy1OboeCQsZQ fOU1I8TUj5PvZji3H9zfR.KQRJiGox07UHDjoqayRtAAttlTUqsFhwbVuGruY1j1TLJkL_pvD1IJ nf7TCmkvjqlMglHdRx2Tz5sQXG7OsoK6PRRr1xm55EnFkCp3.PufMmhq0jAoFCfty3wniKaXFZKS sOqpb6f.lhleT2pu9BtiEj4zA6lCWbAXQKFBoqHNcTtocuQrIOmk17i.M35XYu1mpnp5K1G_nJWH v0GgLnXVDufhZDh.snDavmKujpol4ttvE_Dz283YQJxg8GO1tYxhmZ57iyg8rOZbJpbtgZ9XM1Y5 jrL0hm..BTOOA6bByoqVcWcwTsaVGzE17BbxSYMdkimxZIDi.WqsUqjsnXmjPZcX8CH48h4TKUqg fDrKIjI6uu7U3BUi5DueSbv042v4HP3lkV0DxCfZQfgW4fcXMVPyeCd1XSEPeIhYxsv9kfuF5akb bONNtCYSI3EB53Sou5NerfNXpHYqAsFQS6jq0h0aguBXAREeqofTWvoE8KLSBA9lrcnHYo2._DSA w8l.QqDEIaFrYO1R8vkApXiwKvOeD0uwjxlRf4FwZkbSFXT2nTsZLScvhKz.9nmJtthgdjH4.piG ek2u.LPl2iy4sJqiSAUz5CFP0CIuxfNxLu9HAKj7ly2S95wRt2YDo9IFxRb2FYuNOVR0Fe0oGFpO lGIYPKE8khSb2scX0HNpe1o.4RuKEaoxHCEUycibZVk7X1C7qNZAKSk2EiQcv7FXxJcnaZpJ2KNG NX5UJSiFfqva5uCE5N04M5bBTMfAI.7IUL6oBd1pRgxbuPzXKrvSB Received: from sonic.gate.mail.ne1.yahoo.com by sonic307.consmr.mail.ne1.yahoo.com with HTTP; Thu, 26 Sep 2019 05:03:21 +0000 Received: by smtp408.mail.ne1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA ID 80f364357394574d4332f678e732eb78; Thu, 26 Sep 2019 05:03:16 +0000 (UTC) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n prefer:1 with -l 0-15 vs. -l 16-31 odd performance? From: Mark Millard In-Reply-To: <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> Date: Wed, 25 Sep 2019 22:03:14 -0700 Cc: freebsd-amd64@freebsd.org, freebsd-hackers@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <26B47782-033B-40C8-B8F8-4C731B167243@yahoo.com> References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com> <20190925170255.GA43643@raichu> <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com> <78A4D18C-89E6-48D8-8A99-5FAC4602AE19@yahoo.com> To: Mark Johnston X-Mailer: Apple Mail (2.3445.104.11) X-Rspamd-Queue-Id: 46f2rB3JsBz3J7H X-Spamd-Bar: + X-Spamd-Result: default: False [1.40 / 15.00]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[]; DKIM_TRACE(0.00)[yahoo.com:+]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; IP_SCORE(0.00)[ip: (3.77), ipnet: 66.163.184.0/21(1.31), asn: 36646(1.05), country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:36646, ipnet:66.163.184.0/21, country:US]; ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; NEURAL_HAM_LONG(-0.04)[-0.038,0]; MIME_GOOD(-0.10)[text/plain]; DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0]; NEURAL_SPAM_MEDIUM(0.94)[0.936,0]; IP_SCORE_FREEMAIL(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[33.190.163.66.list.dnswl.org : 127.0.5.0]; RCVD_TLS_LAST(0.00)[]; RWL_MAILSPIKE_POSSIBLE(0.00)[33.190.163.66.rep.mailspike.net : 127.0.0.17]; RCVD_COUNT_TWO(0.00)[2] X-BeenThere: freebsd-amd64@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Porting FreeBSD to the AMD64 platform List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 26 Sep 2019 05:03:23 -0000 On 2019-Sep-25, at 20:27, Mark Millard wrote: > On 2019-Sep-25, at 19:26, Mark Millard wrote: >=20 >> On 2019-Sep-25, at 10:02, Mark Johnston wrote: >>=20 >>> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via = freebsd-amd64 wrote: >>>> Note: I have access to only one FreeBSD amd64 context, and >>>> it is also my only access to a NUMA context: 2 memory >>>> domains. A Threadripper 1950X context. Also: I have only >>>> a head FreeBSD context on any architecture, not 12.x or >>>> before. So I have limited compare/contrast material. >>>>=20 >>>> I present the below basically to ask if the NUMA handling >>>> has been validated, or if it is going to be, at least for >>>> contexts that might apply to ThreadRipper 1950X and >>>> analogous contexts. My results suggest they are not (or >>>> libc++'s now times get messed up such that it looks like >>>> NUMA mishandling since this is based on odd benchmark >>>> results that involve mean time for laps, using a median >>>> of such across multiple trials). >>>>=20 >>>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this >>>> 1950X got got expected results on Fedora but odd ones on >>>> FreeBSD. The benchmark is a variation on the old HINT >>>> benchmark, spanning the old multi-threading variation. I >>>> later tried Fedora because the FreeBSD results looked odd. >>>> The other architectures I tried FreeBSD benchmarking with >>>> did not look odd like this. (powerpc64 on a old PowerMac 2 >>>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive >>>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd >>>> Ed. For these I used 4 threads, not more.) >>>>=20 >>>> I tend to write in terms of plots made from the data instead >>>> of the raw benchmark data. >>>>=20 >>>> FreeBSD testing based on: >>>> cpuset -l0-15 -n prefer:1 >>>> cpuset -l16-31 -n prefer:1 >>>>=20 >>>> Fedora 30 testing based on: >>>> numactl --preferred 1 --cpunodebind 0 >>>> numactl --preferred 1 --cpunodebind 1 >>>>=20 >>>> While I have more results, I reference primarily DSIZE >>>> and ISIZE being unsigned long long and also both being >>>> unsigned long as examples. Variations in results are not >>>> from the type differences for any LP64 architectures. >>>> (But they give an idea of benchmark variability in the >>>> test context.) >>>>=20 >>>> The Fedora results solidly show the bandwidth limitation >>>> of using one memory controller. They also show the latency >>>> consequences for the remote memory domain case vs. the >>>> local memory domain case. There is not a lot of >>>> variability between the examples of the 2 type-pairs used >>>> for Fedora. >>>>=20 >>>> Not true for FreeBSD on the 1950X: >>>>=20 >>>> A) The latency-constrained part of the graph looks to >>>> normally be using the local memory domain when >>>> -l0-15 is in use for 8 threads. >>>>=20 >>>> B) Both the -l0-15 and the -l16-31 parts of the >>>> graph for 8 threads that should be bandwidth >>>> limited show mostly examples that would have to >>>> involve both memory controllers for the bandwidth >>>> to get the results shown as far as I can tell. >>>> There is also wide variability ranging between the >>>> expected 1 controller result and, say, what a 2 >>>> controller round-robin would be expected produce. >>>>=20 >>>> C) Even the single threaded result shows a higher >>>> result for larger total bytes for the kernel >>>> vectors. Fedora does not. >>>>=20 >>>> I think that (B) is the most solid evidence for >>>> something being odd. >>>=20 >>> The implication seems to be that your benchmark program is using = pages >>> from both domains despite a policy which preferentially allocates = pages >>> from domain 1, so you would first want to determine if this is = actually >>> what's happening. As far as I know we currently don't have a good = way >>> of characterizing per-domain memory usage within a process. >>>=20 >>> If your benchmark uses a large fraction of the system's memory, you >>> could use the vm.phys_free sysctl to get a sense of how much memory = from >>> each domain is free. >>=20 >> The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per = memory >> domain. I've never configured the benchmark such that it even reaches >> 10 GiBytes on this hardware. (It stops for a time constraint first, >> based on the values in use for the "adjustable" items.) >>=20 >> . . . (much omitted material) . . . >=20 >>=20 >>> Another possibility is to use DTrace to trace the >>> requested domain in vm_page_alloc_domain_after(). For example, the >>> following DTrace one-liner counts the number of pages allocated per >>> domain by ls(1): >>>=20 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls" >>> ... >>> 0 71 >>> 1 72 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 = ls" >>> ... >>> 1 143 >>> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry = /progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 = ls" >>> ... >>> 0 143 >>=20 >> I'll think about this, although it would give no >> information which CPUs are executing the threads >> that are allocating or accessing the vectors for >> the integration kernel. So, for example, if the >> threads migrate to or start out on cpus they >> should not be on, this would not report such. >>=20 >> For such "which CPUs" questions one stab would >> be simply to watch with top while the benchmark >> is running and see which CPUs end up being busy >> vs. which do not. I think I'll try this. >=20 > Using top did not show evidence of the wrong > CPUs being actively in use. >=20 > My variation of top is unusual in that it also > tracks some maximum observed figures and reports > them, here being: >=20 > 8804M MaxObsActive, 4228M MaxObsWired, 13G MaxObs(Act+Wir) >=20 > (no swap use was reported). This gives a system > level view of about how much RAM was put to use > during the monitoring of the 2 benchmark runs > (-l0-15 and -l16-31). No where near enough used > to require both memory domains to be in use. >=20 > Thus, it would appear to be just where the > allocations are made for -n prefer:1 that > matters, at least when a (temporary) thread > does the allocations. >=20 >>> This approach might not work for various reasons depending on how >>> exactly your benchmark program works. >=20 > I've not tried dtrace yet. Well, for an example -l0-15 -n prefer:1 run for just the 8 threads benchmark case . . . dtrace: pid 10997 has exited 0 712 1 6737529 Something is leading to domain 0 allocations, despite -n prefer:1 . So I tried -l16-31 -n prefer:1 and it got: dtrace: pid 11037 has exited 0 2 1 8055389 (The larger number of allocations is not a surprise: more work done in about the same overall time based on faster memory access generally.) =3D=3D=3D Mark Millard marklmi at yahoo.com ( dsl-only.net went away in early 2018-Mar)