From owner-freebsd-amd64@freebsd.org  Thu Sep 26 02:26:54 2019
Return-Path: <owner-freebsd-amd64@freebsd.org>
Delivered-To: freebsd-amd64@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id F3D00F358F
 for <freebsd-amd64@mailman.nyi.freebsd.org>;
 Thu, 26 Sep 2019 02:26:53 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
Received: from sonic307-8.consmr.mail.gq1.yahoo.com
 (sonic307-8.consmr.mail.gq1.yahoo.com [98.137.64.32])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46dzMc2hV0z4sdq
 for <freebsd-amd64@freebsd.org>; Thu, 26 Sep 2019 02:26:52 +0000 (UTC)
 (envelope-from marklmi@yahoo.com)
X-YMail-OSG: IdgWFNMVM1nltA91qn9kCoa1srQWZiyVsaMEFe6PWlZp9iroQ4j62b2jR4jQ5PT
 K4IipZjMYHmbVz3sWc04ryVCbO4WWLxp6JRrAj_Dw.5YPCa0GEAxPNTSa5EUWsMwMqahIgiwc7EA
 lumFiUmji9_zSlEqhna3dTezOgCfwxmhWr21ZDXe5CXC2YXx96Zcz.zYxlFLn9G4kEk57h4.TRDM
 .2KnwGJA59TVbFL7vprmxq5RoOBXuXxT55Wkz1PD8MSXG_BwhaSZT0KEbyokvnkVfQqI1bI0kFQL
 Ozh28aiofp5IGmsHEmYnFTIPvQMLRx4.ZBq3MkRGB2ClCoCd0bBEdHoYT8V5D5DsiCb_nqxR2gbT
 d1zf6LeJ2lSXsK8o3Zc_l1H9kbhIBiZmglWPwVRnl82g4KY6Kv581vlSdY.PuJV1Fjyf1Lwv9Cdq
 o51V4AU49SAwfmDOXBpySUQWi9MHXJ0OSPmpUc_RZxND4axsOEGqAAxEieL6K5ISEVx7hfX6Zpds
 V9Z_b9SRl.I.SbCds9CfJirVcUTr1dfdC6QJbdvzAsw5e5ctWOlwH8bnzmvaZ2_42jj7dPem8LHG
 YW7hkck4WvWvacxNAJ7ZCpfGUKTDwcgMsrKelR4H9tm71sJRoviHUUAO3C2QB6ICcYfEKBEcLJjE
 o8CyF7L35ILSXn9y7RIHxrW7FReVZ0uG_Kc2G5wfmtLMPo7CxTSDN5Zktzi8Av5E4K.7Bydz35FD
 obvId_0jV8M0kaTNw5CyUkZveYFCs3_8oOw15hDY86EvVhPpsMwySaKylLzgEp1fYupUeOi67isx
 8mAtgNVfmCsPQnZbrFb5Jm2seIOKTn41rurlJx_9kFT1MSG6XcZUZMGTt6eD9.lCB_ooMso_yYK4
 764XxSaSSbN7mAUrSBqTVpengQu5UWsqIKFqaWDToUyvKDJjXxyUa5z1EGSiwtH.I.j80B_t6F9w
 CMm9oP7sJ9PLsBf75WxzAtDLncUTWr8c00V_iJNGSs6Lash7d7x2IrQmTc8Tomj6BI5dIcTBtitV
 abNjuxBR6tsd647q_AbRdIVZKQBT_L4sOlMh_VhjU6E.XteCesh9rEnqRV8oc2Og5t1C248xRxgX
 uZgH0RAP1aXKCjWl7TYTHPOLSclia.9QesPIibaKicwDGjwu_VlfeeUKXp_ghzGCDlLQFwXBMH4W
 XXLuYeH92JgYfhQPppsB1LqNKiqnaqxpTjMngvAiCooyPruXYmj_CL.cu64Ac1pTdiztGMDLJjMg
 6LE.ixpruR1.M8DSDSleoc0SAhu9Kb_L1LJDXxILe3v23TdGMMX_1bIx.jNTxVEhwIg--
Received: from sonic.gate.mail.ne1.yahoo.com by
 sonic307.consmr.mail.gq1.yahoo.com with HTTP; Thu, 26 Sep 2019 02:26:50 +0000
Received: by smtp421.mail.gq1.yahoo.com (Oath Hermes SMTP Server) with ESMTPA
 ID 4b6622377680b3aa5486ead83e10c84e; 
 Thu, 26 Sep 2019 02:26:47 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: head -r352341 example context on ThreadRipper 1950X: cpuset -n
 prefer:1 with -l 0-15 vs. -l 16-31 odd performance?
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <20190925170255.GA43643@raichu>
Date: Wed, 25 Sep 2019 19:26:46 -0700
Cc: freebsd-amd64@freebsd.org,
 freebsd-hackers@freebsd.org
Content-Transfer-Encoding: quoted-printable
Message-Id: <4F565B02-DC0D-4011-8266-D38E02788DD5@yahoo.com>
References: <704D4CE4-865E-4C3C-A64E-9562F4D9FC4E@yahoo.com>
 <20190925170255.GA43643@raichu>
To: Mark Johnston <markj@FreeBSD.org>
X-Mailer: Apple Mail (2.3445.104.11)
X-Rspamd-Queue-Id: 46dzMc2hV0z4sdq
X-Spamd-Bar: +
X-Spamd-Result: default: False [1.85 / 15.00]; TO_DN_SOME(0.00)[];
 R_SPF_ALLOW(-0.20)[+ptr:yahoo.com];
 FREEMAIL_FROM(0.00)[yahoo.com]; MV_CASE(0.50)[];
 DKIM_TRACE(0.00)[yahoo.com:+];
 DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject];
 FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+];
 IP_SCORE(0.00)[ip: (7.39), ipnet: 98.137.64.0/21(0.94), asn: 36647(0.75),
 country: US(-0.05)]; FREEMAIL_ENVFROM(0.00)[yahoo.com];
 SUBJECT_ENDS_QUESTION(1.00)[]; MID_RHS_MATCH_FROM(0.00)[];
 ASN(0.00)[asn:36647, ipnet:98.137.64.0/21, country:US];
 ARC_NA(0.00)[]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 MIME_GOOD(-0.10)[text/plain];
 DWL_DNSWL_NONE(0.00)[yahoo.com.dwl.dnswl.org : 127.0.5.0];
 NEURAL_SPAM_MEDIUM(0.95)[0.955,0]; IP_SCORE_FREEMAIL(0.00)[];
 TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_SPAM_LONG(0.40)[0.399,0];
 RCVD_IN_DNSWL_NONE(0.00)[32.64.137.98.list.dnswl.org : 127.0.5.0];
 RCVD_TLS_LAST(0.00)[];
 RWL_MAILSPIKE_POSSIBLE(0.00)[32.64.137.98.rep.mailspike.net : 127.0.0.17];
 RCVD_COUNT_TWO(0.00)[2]
X-BeenThere: freebsd-amd64@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Porting FreeBSD to the AMD64 platform <freebsd-amd64.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-amd64/>
List-Post: <mailto:freebsd-amd64@freebsd.org>
List-Help: <mailto:freebsd-amd64-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-amd64>,
 <mailto:freebsd-amd64-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 26 Sep 2019 02:26:54 -0000


On 2019-Sep-25, at 10:02, Mark Johnston <markj at reeBSD.org> wrote:

> On Mon, Sep 23, 2019 at 01:28:15PM -0700, Mark Millard via =
freebsd-amd64 wrote:
>> Note: I have access to only one FreeBSD amd64 context, and
>> it is also my only access to a NUMA context: 2 memory
>> domains. A Threadripper 1950X context. Also: I have only
>> a head FreeBSD context on any architecture, not 12.x or
>> before. So I have limited compare/contrast material.
>>=20
>> I present the below basically to ask if the NUMA handling
>> has been validated, or if it is going to be, at least for
>> contexts that might apply to ThreadRipper 1950X and
>> analogous contexts. My results suggest they are not (or
>> libc++'s now times get messed up such that it looks like
>> NUMA mishandling since this is based on odd benchmark
>> results that involve mean time for laps, using a median
>> of such across multiple trials).
>>=20
>> I ran a benchmark on both Fedora 30 and FreeBSD 13 on this
>> 1950X got got expected  results on Fedora but odd ones on
>> FreeBSD. The benchmark is a variation on the old HINT
>> benchmark, spanning the old multi-threading variation. I
>> later tried Fedora because the FreeBSD results looked odd.
>> The other architectures I tried FreeBSD benchmarking with
>> did not look odd like this. (powerpc64 on a old PowerMac 2
>> socket with 2 cores per socket, aarch64 Cortex-A57 Overdrive
>> 1000, CortextA53 Pine64+ 2GB, armv7 Cortex-A7 Orange Pi+ 2nd
>> Ed. For these I used 4 threads, not more.)
>>=20
>> I tend to write in terms of plots made from the data instead
>> of the raw benchmark data.
>>=20
>> FreeBSD testing based on:
>> cpuset -l0-15  -n prefer:1
>> cpuset -l16-31 -n prefer:1
>>=20
>> Fedora 30 testing based on:
>> numactl --preferred 1 --cpunodebind 0
>> numactl --preferred 1 --cpunodebind 1
>>=20
>> While I have more results, I reference primarily DSIZE
>> and ISIZE being unsigned long long and also both being
>> unsigned long as examples. Variations in results are not
>> from the type differences for any LP64 architectures.
>> (But they give an idea of benchmark variability in the
>> test context.)
>>=20
>> The Fedora results solidly show the bandwidth limitation
>> of using one memory controller. They also show the latency
>> consequences for the remote memory domain case vs. the
>> local memory domain case. There is not a lot of
>> variability between the examples of the 2 type-pairs used
>> for Fedora.
>>=20
>> Not true for FreeBSD on the 1950X:
>>=20
>> A) The latency-constrained part of the graph looks to
>>   normally be using the local memory domain when
>>   -l0-15 is in use for 8 threads.
>>=20
>> B) Both the -l0-15 and the -l16-31 parts of the
>>   graph for 8 threads that should be bandwidth
>>   limited show mostly examples that would have to
>>   involve both memory controllers for the bandwidth
>>   to get the results shown as far as I can tell.
>>   There is also wide variability ranging between the
>>   expected 1 controller result and, say, what a 2
>>   controller round-robin would be expected produce.
>>=20
>> C) Even the single threaded result shows a higher
>>   result for larger total bytes for the kernel
>>   vectors. Fedora does not.
>>=20
>> I think that (B) is the most solid evidence for
>> something being odd.
>=20
> The implication seems to be that your benchmark program is using pages
> from both domains despite a policy which preferentially allocates =
pages
> from domain 1, so you would first want to determine if this is =
actually
> what's happening.  As far as I know we currently don't have a good way
> of characterizing per-domain memory usage within a process.
>=20
> If your benchmark uses a large fraction of the system's memory, you
> could use the vm.phys_free sysctl to get a sense of how much memory =
from
> each domain is free.

The ThreadRipper 1950X has 96 GiBytes of ECC RAM, so 48 GiBytes per =
memory
domain. I've never configured the benchmark such that it even reaches
10 GiBytes on this hardware. (It stops for a time constraint first,
based on the values in use for the "adjustable" items.)

The benchmark runs the Hierarchical INTegeration kernel for a sequence
of larger and larger number of cells in the grid that it uses. Each
size is run in isolation before the next is tried, each gets its own
timings. Each size gets its own kernel vector allocations (and
deallocations) with the trails and laps within a trail reusing the
same memory. Each lap in each trial gets its own thread creations (and
completions). The main thread combines the results when there are
multiple threads involved. (So I'm not sure of the main thread's
behavior relative to the cpuset commands.)

Thus, there are lots of thread creations overall, as well as
lots of allocations of vectors for use in the integration
kernel code.

What it looks like to me that the std::async's internal thread
creations are not respecting the cpuset command settings: in a
sense, not inheriting the cpuset information correctly (or such
is being ignored).

For reference, the following shows the std::async use for
for the multi-threaded case.

Normal builds plug in no-op code for:

RestrictThreadToCpu(. . ., . . .); // if built for such

(intended as a hook for potential experiments that cpuset can
not set up for multi-threaded). I make this point because
the call shows up below but it is not doing anything here.

One std::async use is for where the kernel vector memory
allocations are done:

        for (HwConcurrencyCount thread{0u}; thread<ki.nproc; ++thread)
        {
            auto alloc_thread=3D std::async
                            ( std::launch::async
                            , [thread,memry,&ki,&threads_kvs]()
                                {
                                    =
RestrictThreadToCpu(thread,ki.nproc);
                                                        // if built for =
such
                                    threads_kvs.emplace_back
                                          =
(KernelVectors<DSIZE,ISIZE>{memry});
                                }
                            );
            alloc_thread.wait();
        }

So the main thread is not doing the allocations: created,
temporary threads are.

As for the running the trials and laps of the integration kernel
for a given size of grid for the integration, each lap creates
its own threads:

        for ( auto trials_left{NTRIAL<DSIZE,ISIZE>} // 0u<NTRIAL =
required.
            ; 0u<trials_left
            ; --trials_left
            )
        {
            auto const start{clock_info.Now()};
   =20
            KernelResults<DSIZE,ISIZE> result{};
            for ( auto lap_count_down{laps}
                ; 0u<lap_count_down
                ; --lap_count_down
                )
            {
                for(HwConcurrencyCount thread{0u}; thread<ki.nproc; =
++thread)
                {
                    in_parallel[thread]=3D std::async
                        ( std::launch::async
                        , [thread,&ki,&threads_kvs]()
                            {
                                RestrictThreadToCpu(thread,ki.nproc);
                                                        // if built for =
such
                                return Kernel<DSIZE,ISIZE>
                                                        ( thread
                                                        , ki
                                                        , =
threads_kvs[thread]
                                                        );
                            }
                        );
                }
                KernelResults<DSIZE,ISIZE> lap_result{};
   =20
                for(auto& thread : in_parallel)
                {
                    lap_result.Merge(thread.get());
                }
               =20
                result=3D lap_result; // Using the last lap's result
            }
       =20
            auto const finish{clock_info.Now()};
           =20
            . . . (process the measurement, no threading) . . .
        }

Based on such and each cpuset command that I reported, I'd not
expect any variability in which domains the memory allocations
are made from or which domain's cpus are using the memory the
accesses.

For reference for how kernel vectors are structured:

template<typename DSIZE, typename ISIZE>
struct KernelVectors
{
    using RECTVector =3D std::vector<RECT<DSIZE>>;   =20
    using ErrsVector =3D std::vector<DSIZE>;
   =20
    using IxesVector =3D std::vector<ISIZE>;
            // Holds indexes into rect and errs.

    RECTVector  rect;
    ErrsVector  errs;
    IxesVector  ixes; // indexes into rect and errs.
   =20
    KernelVectors() =3D delete;
   =20
    KernelVectors(ISIZE memry) : rect(memry), errs(memry*2), =
ixes(memry*2)
    {}
   =20
    . . . (irrelevant methods omitted) . . .
}; // KernelVectors

with (irrelevant comment lines eliminated):

template<typename DSIZE>
struct RECT
{
    DSIZE   ahi, // Upper bound via rectangle areas for scx by scy =
breakdown
            alo, // Lower bound via rectangle areas for scx by scy =
breakdown
            dx,  // Interval widths, SEE NOTES BELOW.
            flh, // Function values of left  coordinates, high
            fll, // Function values of left  coordinates, low
            frh, // Function values of right coordinates, high
            frl, // Function values of right coordinates, low
            xl,  // Left  x-coordinates of subintervals
            xr;  // Right x-coordinates of subintervals
}; // RECT

Even the single-threaded integration kernel case
executes the kernel vector memory allocation step
and the trails and laps via std::async instead of
using the main thread for such.


Note: The original HINT's copyright holder, Iowa State
University Research Foundation, Inc., indicated that
HINT was intended to be licensed via GPLv2 (not earlier
and not later), despite how it was (inappropriately)
distributed wihtout indicating which GPL version back
then. Thus, overall, this variation on HINT is also
GPLv2-only in order to respect the original intent.

> Another possibility is to use DTrace to trace the
> requested domain in vm_page_alloc_domain_after().  For example, the
> following DTrace one-liner counts the number of pages allocated per
> domain by ls(1):
>=20
> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n rr ls"
> ...
> 	0               71
> 	1               72
> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:1 =
ls"
> ...
> 	1              143
> # dtrace -n 'fbt::vm_page_alloc_domain_after:entry =
/progenyof($target)/{@[args[2]] =3D count();}' -c "cpuset -n prefer:0 =
ls"
> ...
> 	0              143

I'll think about this, although it would give no
information which CPUs are executing the threads
that are allocating or accessing the vectors for
the integration kernel. So, for example, if the
threads migrate to or start out on cpus they
should not be on, this would not report such.

For such "which CPUs" questions one stab would
be simply to watch with top while the benchmark
is running and see which CPUs end up being busy
vs. which do not. I think I'll try this.

> This approach might not work for various reasons depending on how
> exactly your benchmark program works.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)