From nobody Fri Mar 24 21:17:13 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Pjw5j6vfJz417xr
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Fri, 24 Mar 2023 21:17:29 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
Received: from sonic309-21.consmr.mail.gq1.yahoo.com (sonic309-21.consmr.mail.gq1.yahoo.com [98.137.65.147])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Pjw5j467Rz3jQq
	for <freebsd-hackers@freebsd.org>; Fri, 24 Mar 2023 21:17:29 +0000 (UTC)
	(envelope-from marklmi@yahoo.com)
Authentication-Results: mx1.freebsd.org;
	none
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1679692647; bh=5y7ZqkN/iiK581CEfO4QPbRJO6T+dyY0HuZ03C5y3Gk=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=sTnhpEXnOEt5ER1CqYDVTvVIrx2+aGgtI5i0k5Mc5wb03q3kliclNFDLlKSTl2VskneO4h1m+mS72YxcBMQiFdg2sP95IJHymWR/EoCbAMKEnS1VZcwkUlJC54ClLrEtAxnJ13EDOQuZU0rWIOWBsMR/ML1So6IffHv/GLZxdA5QOaZ/t0MuP5OoTIe/eXptIqVQN9P75zKDZOR4BHwD2Ii99n2OhIEyuU/dYr0BldjDEHNWC0FX4g721XoZuBBxRHqi2dGitxSwCyRzY17GgiMI0zQXxuZ52JDnmWSK3oYeSS2Vp56kkxL0ilP2yKe2iF/m0Yp6vUOMcIaqZqgMSg==
X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1679692647; bh=pzF5JGbaC2vCsM6z8mx7ZFTXvueyW+swoQ5CLUr+OZA=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=R6Q3dinfYgg0hpzIjKZayPIsF+LgAmamjYqEEMR8G2DCnFDLyRuL965GfF6s+RWwver3QzJb+m8OLw4wLgYhHsYmKEb/LADFLS9fa8SKhtUGO9R9nhIXF6MT4/SypeRLE/uv5Cphd16ZfCBkskZG+frYhgOv1SqRr6JACOPtzvLFi7VxnjtEs7vyB9NMv3LO9iYrfm10BzKnCuAqt4+I/MYTnM787HLgQkj8ajr0vUPScQFWdidEVK2tdVDO22a0PI2ucL1MgtGwYujWZECqZJX2/fLn5NLxHDHVhBxuyLaT9CHuEYs/0w0BNlUtVMfy+VO1Kc/b7/VJF0Qx2doWNg==
X-YMail-OSG: L0gkl_YVM1mExvP.A3qTFpCaZKU11CmsLudAagh03T9_w_oCD2ohXD6XYLYJvp9
 OMZrvat2PGAT2gbg8sw5o5o871Aa0WrLEbPEjFxUPoBK2IYb7o2wVq970z93J4AuqhFXMqWFJPqn
 C_WXc4PWqGhvWTSfTB9aR37i.i7ExPQAYu07N.Q51Z7X8k9eMJFdEOHV.t1Zwq5lTXXfFtL9mbky
 HSuZfJQllhkTn8TsiTN962q2kPdmvhzEgAAsXIff83kR_GiKvdBP.ME_aPKPm4wtH0F2ngclq8XB
 VqqqomR6i4bZTJYJSXQtNUbvwhYYG9fonA2FJIRBhqlBSgszK2owsxkjWfdsWcie5AMEJpIimejZ
 MtiylMrkTg30vVWJjFo_fvW5KFtmNUoaD6twQ5WaolOA4sHILjVp2vtbwIDn4KUdHFmzMDxeEWpm
 N68TBB9knvL3M1TclUWBw0EnxU7Ksdw0WTom56S4VnQ9EGkMvJf5mjMBwMhB8IwYBpP3UnYjWq7j
 B8G0aafYOWu9fw2BQ_y87upT99qri43.YIF0Fd_LvhygJ33lpT1pIbQDxR1ZDbmceIw.W0ZYe3yI
 bDeLeXgWNucW.0j7jbPYxW1QRzG8qyk_GyqOxuqWG_gw.jYVw.BAq.57iNoUeXDaxPiKDKMTuRaN
 3B_PgBlO84wR.BNVngnn86.rSuUyiRZVO572jGjMg9s4C0PvM9nNvXr0jyWhLiQg8mzCT1Qn.vyO
 rvpONZkA4nLAtAw62Q4UX0DmDcPUNqqcL7hqKpYDCiE0L48t3km4C1HIyC1BPjCiHPMR0MXpAiwD
 qq_Ugg6GVKtodpF.8RzPHL37GPHXb3Ohw9JvTBENs8hLDxsMc7kTxlwPsCK.tHctDby2938Sn8u8
 LB_rNiatRvQp8OAVBKTkd0z0PO5pOj7R92VQcadqKwL6nvEDFeI4F0fSTFVwL5cvmBdhub5qGnIP
 Njg1U988m.p_dpjTHj5JAazwmHH2n5TIVfa_M57jL0A2cvcrDw2dqTeBGgoxvdVwTWt0wXYFhAvU
 UhJGrMWYNPXhUCKuzNIE3tzk2ngqcB55b1tz.QNEmIQps7LmapXGz2BnwNf11l1WCOY2erukys4.
 575zWMZBd.n0rbRBuSV5QMvMPz6lTSIMD9So1dgLYDEx2v5mEooQTxlOFLlUESLGsmMXcgOmon4Q
 6SIzsUeekOuUjiptVIwBTqpmwTN4F5qWM9h5WiSF3bjTpGUvtF_TMwfT4EuVd3DSGOXFdqc0fNj4
 _NK54eXXzG50FYsKvqM4dgva7Ge.OdqWOrhTS92DGhbdr1wYcz9O6D7Lj.D6hObxi1YU508UNmjJ
 ZjZD_VTpBIJZJQUAYfSDP1xHPN2fFmG4YkNk6X82264oiGKhdBJhR649kypxd3cv3VupMslOdCRi
 JIGWmEGuuO1eV3lE0vBWJFoxj3NtzUOJneYhPd05n0DTOZ49iuogpIeRSavyY.dw2WD9cs4wuuJ0
 ODUkRIHyPBU3ZM9sTgHA4q9OcunEzsFi8cgS2hf_VO4M20lloSIAGWIJPsDOUVc3ukUdIVm_SZyX
 sOk1RZlMEFEY5GVmfmN6KkOWLS.vpMOrOpD4jIdcksDm.Uz9M60SUtp_cvPHNLsMwvhybm.B62Wk
 jgXReAXpCGntNB30C4Kz_7JMzwbjgmSGPBpBCRiLAXOdA8liTaIGZ1JfR.oHdhIA6A0_gUZqUZ5S
 _DphHW4pqfubemR56HGjPLx3yYVvUkKafPCfMTTl5vnIkCzj_r7NjBgjButE4mENzJOxTocHHK6p
 3MqK4KV1gy4BVlbZRTMkjUxTyNJwckskJUVIqdOQeIEAseuOfRx9F0s2QwRX1Q5U9Y4nO2h_PH3C
 Tvys8agKjQePg2olwcvtk6axjtq_e_9dxYc1VV88smKgde1LEehY40qkNu26U_KJ.52rxz.p5HzW
 aaCxbK_InU1cWXIVT3prgiLDrwQ8mZFa2ViitoolbZZM5IVyE1NF1W.TkQvnwnuMswW74cBlgkL9
 IUGdoPZCDvI3r4w1MNzx8wCeDPGAvigLIeigLHoAmdYegexu89P8Me0pRYG7pQy8OT1Y9S6MGJQA
 VURGXouMcM74kvpubnOlif2eHK7rZtgRtDrClAWnfJI4dnlpMKRli7gVLDmq7VxMWv0BT97k3m.x
 pDTU.SGfhOmT4YZJ3RbRqZ2144LuPBDggDrNZ59d3d8Q_6QaGnZjvuBOcuyK_se5ku35tkU4x7I1
 Fb_YuPmRWW7dVZMDEigENn7cyz3BikuCXqwjDNM_kmDxwpSN1BhZOcnk0PqFRQd1vDqzXsMjwbzp
 raoU-
X-Sonic-MF: <marklmi@yahoo.com>
X-Sonic-ID: a6f144cf-a360-4d4f-89e0-e9ca5cd0ec3d
Received: from sonic.gate.mail.ne1.yahoo.com by sonic309.consmr.mail.gq1.yahoo.com with HTTP; Fri, 24 Mar 2023 21:17:27 +0000
Received: by hermes--production-gq1-6cf7749bc8-kkr9m (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 7f9549e00bc39a0fb6a0d458a969a94a;
          Fri, 24 Mar 2023 21:17:24 +0000 (UTC)
Content-Type: text/plain;
	charset=us-ascii
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.400.51.1.1\))
Subject: Re: Periodic rant about SCHED_ULE
From: Mark Millard <marklmi@yahoo.com>
In-Reply-To: <ZB4HLE5IUxD9lcXx@troutmask.apl.washington.edu>
Date: Fri, 24 Mar 2023 14:17:13 -0700
Cc: FreeBSD Hackers <freebsd-hackers@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <374296F5-892E-48F4-858D-20E15B494AE6@yahoo.com>
References: <6A29E7ED-0A1E-49F9-9224-AC3D5B0E0732.ref@yahoo.com>
 <6A29E7ED-0A1E-49F9-9224-AC3D5B0E0732@yahoo.com>
 <ZB4HLE5IUxD9lcXx@troutmask.apl.washington.edu>
To: sgk@troutmask.apl.washington.edu
X-Mailer: Apple Mail (2.3731.400.51.1.1)
X-Rspamd-Queue-Id: 4Pjw5j467Rz3jQq
X-Spamd-Bar: ----
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-ThisMailContainsUnwantedMimeParts: N

On Mar 24, 2023, at 13:25, Steve Kargl =
<sgk@troutmask.apl.washington.edu> wrote:

> On Fri, Mar 24, 2023 at 12:47:08PM -0700, Mark Millard wrote:
>> Steve Kargl <sgk_at_troutmask.apl.washington.edu> wrote on
>> Date: Wed, 22 Mar 2023 19:04:06 UTC :
>>=20
>>> I reported the issue with ULE some 15 to 20 years ago.
>>> I gave up reporting the issue. The individuals with the
>>> requisite skills to hack on ULE did not; and yes, I lack
>>> those skills. The path of least resistance is to use
>>> 4BSD.
>>>=20
>>> % cat a.f90
>>> !
>>> ! Silly numerically intensive computation.
>>> !
>>> program foo
>>> implicit none
>>> integer, parameter :: m =3D 200, n =3D 1000, dp =3D kind(1.d0)
>>> integer i
>>> real(dp) x
>>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
>>> call random_init(.true., .true.)
>>> allocate(a(n,n), b(n,n))
>>> do i =3D 1, m
>>> call random_number(a)
>>> call random_number(b)
>>> c =3D matmul(a,b)
>>> x =3D sum(c)
>>> if (x < 0) stop 'Whoops'
>>> end do
>>> end program foo
>>> % gfortran11 -o z -O3 -march=3Dnative a.f90=20
>>> % time ./z
>>> 42.16 real 42.04 user 0.09 sys
>>> % cat foo
>>> #! /bin/csh
>>> #
>>> # Launch NCPU+1 images with a 1 second delay
>>> #
>>> foreach i (1 2 3 4 5 6 7 8 9)
>>> ./z &
>>> sleep 1
>>> end
>>> % ./foo
>>>=20
>>> In another xterm, you can watch the 9 images.
>>>=20
>>> % top
>>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 11:43:01
>>> 74 processes: 10 running, 64 sleeping
>>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle
>>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, =
14G Free
>>> Swap: 16G Total, 16G Free
>>>=20
>>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND
>>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z
>>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z
>>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z
>>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z
>>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z
>>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z
>>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z
>>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z
>>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z
>>>=20
>>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's =
exit
>>> after 55-ish seconds. If you try this experiment on ULE, you'll get =
NCPU-1
>>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as =
the
>>> two images ping-pong on one cpu. Back when I was testing ULE vs =
4BSD,
>>> this was/is due to ULE's cpu affinity where processes never migrate =
to
>>> another cpu. Admittedly, this was several years ago. Maybe ULE has
>>> gotten better, but George's rant seems to suggest otherwise.
>>=20
>> Note: I'm only beginning to explore your report/case.
>>=20
>> There is a significant difference in your report and
>> George's report: his is tied to nice use (and I've
>> replicated there being SCHED_4BSD vs. SCHED_ULE
>> consequences in the same direction George reports
>> but with much larger process counts involved). In
>> those types of experiments, I without the nice use
>> I did not find notable differences. But it is a
>> rather different context than your examples. Thus
>> the below as a start on separate experiments closer
>> to what you report using.
>=20
> Yes, I recognizes George's case is different.  However,
> the common problem is ULE.  My testcase shows/suggests
> that ULE is unsuitable for a HPC platform.
>=20
>> Not (yet) having a Fortran set up I did some simple
>> expriments with stress --cpu N (N processes looping
>> sqrt calculations) and top. In top I sorted by pid
>> to make it obvious if a fixed process was getting a
>> fixed CPU or WCPU. (I tried looking at both CPU and
>> WCPU, varying the time between samples as well. I
>> also varied stress's --backoff N . This was on a
>> ThreadRipper 1950X (32 hardware threads, so 16 cores)
>> that was running:
>=20
> You only need a numerically intensive program that runs
> for 30-45 seconds.

Well, with 32 hardware threads instead of 8, the
time frames likely need to be longer proportionally:
33 processes created and run, with overlapping time
needed.

> I use Fortran everyday and wrote the
> silly example in 5 minutes.  The matrix multiplication
> of two 1000x1000 double precision matrices has two
> benefits with this synthetic benchmark.  It takes 40-ish
> seconds on my hardware (AMD FX-8350) and it blows out the
> cpu cache.

I've not checked on the caching issue for what I've
done below. Let me know if you expect it is important
to check.

>> This seems at least suggestive that, in my context, the
>> specific old behavior that you report does not show up,
>> at least on the timescales that I was observing at. It
>> still might not be something you would find appropriate,
>> but its does appear to at least be different.
>>=20
>> There is the possibility that stress --cpu N leads to
>> more being involved than I expect and that such is
>> contributing to the behavior that I've observed.
>=20
> I can repeat the openmpi testing, but it will have to=20
> wait for a few weeks as I have a pressing deadline.

I'll be curious to learn what you then find.

> The openmpi program is a classic boss-worker scenario
> (and an almost perfectly parallel application with litttle
> communication overhead).  boss starts and initializes the
> environment and then launches numerical intensive=20
> workers.  If boss+n workers > ncpu, you get a boss and
> a worker pinned to a cpu.  If boss and worker ping-pong,
> it stalls the entire program.

=46rom what I've seen, boss+1worker doing ping-pong at
times would not be prevented from happening sometimes
for a while but would not be sustained indefinitely.

> Admittedly, I last tested years ago.  ULE may have had
> improvements.

Actually I do have a fortran: gfortran12 (automatically).
(My original search had a typo.)

I'll have to adjust the parameters for your example:

# gfortran12 -o z -O3 -march=3Dnative a.f90
# time ./z
       27.91 real        27.85 user         0.06 sys

but I've 32 hardware threads, so the loop waiting for
1 sec between for 33 examples would have the first ones
exit before the last ones start.

Looks like n=3D2000 would be sufficient:

# gfortran12 -o z -O3 -march=3Dnative a.f90
# time ./z
      211.25 real       211.06 user         0.18 sys

For 33 processes, things are as I described when I
look with the likes of:

# top -a -opid -s5

Varying the time scale to shorter allows seeing process
WCPU figures move around more between the processes
more. Longer shows less of the WCPU variability across
the processes. (As I remember, -s defaults to 3 seconds
and has a minimum of 1 second.)

Given the 32 hardware threads, I used 33 processes via:

# more runz
#! /bin/csh
#
# Launch NCPU+1 images with a 1 second delay
#
foreach d (1 2 3)
foreach i (1 2 3 4 5 6 7 8 9 10)
   ./z &
   sleep 1
end
end
foreach j (1 2 3)
   ./z &
   sleep 1
end


My guess is that if you end up seeing what you
originally described, some environmental
difference would be involved in why I see
different behavior, something to then be
tracked down for what is different in the
2 contexts.

=3D=3D=3D
Mark Millard
marklmi at yahoo.com