From nobody Fri Mar 24 21:17:13 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Pjw5j6vfJz417xr for ; Fri, 24 Mar 2023 21:17:29 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic309-21.consmr.mail.gq1.yahoo.com (sonic309-21.consmr.mail.gq1.yahoo.com [98.137.65.147]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4Pjw5j467Rz3jQq for ; Fri, 24 Mar 2023 21:17:29 +0000 (UTC) (envelope-from marklmi@yahoo.com) Authentication-Results: mx1.freebsd.org; none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1679692647; bh=5y7ZqkN/iiK581CEfO4QPbRJO6T+dyY0HuZ03C5y3Gk=; h=Subject:From:In-Reply-To:Date:Cc:References:To:From:Subject:Reply-To; b=sTnhpEXnOEt5ER1CqYDVTvVIrx2+aGgtI5i0k5Mc5wb03q3kliclNFDLlKSTl2VskneO4h1m+mS72YxcBMQiFdg2sP95IJHymWR/EoCbAMKEnS1VZcwkUlJC54ClLrEtAxnJ13EDOQuZU0rWIOWBsMR/ML1So6IffHv/GLZxdA5QOaZ/t0MuP5OoTIe/eXptIqVQN9P75zKDZOR4BHwD2Ii99n2OhIEyuU/dYr0BldjDEHNWC0FX4g721XoZuBBxRHqi2dGitxSwCyRzY17GgiMI0zQXxuZ52JDnmWSK3oYeSS2Vp56kkxL0ilP2yKe2iF/m0Yp6vUOMcIaqZqgMSg== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1679692647; bh=pzF5JGbaC2vCsM6z8mx7ZFTXvueyW+swoQ5CLUr+OZA=; h=X-Sonic-MF:Subject:From:Date:To:From:Subject; b=R6Q3dinfYgg0hpzIjKZayPIsF+LgAmamjYqEEMR8G2DCnFDLyRuL965GfF6s+RWwver3QzJb+m8OLw4wLgYhHsYmKEb/LADFLS9fa8SKhtUGO9R9nhIXF6MT4/SypeRLE/uv5Cphd16ZfCBkskZG+frYhgOv1SqRr6JACOPtzvLFi7VxnjtEs7vyB9NMv3LO9iYrfm10BzKnCuAqt4+I/MYTnM787HLgQkj8ajr0vUPScQFWdidEVK2tdVDO22a0PI2ucL1MgtGwYujWZECqZJX2/fLn5NLxHDHVhBxuyLaT9CHuEYs/0w0BNlUtVMfy+VO1Kc/b7/VJF0Qx2doWNg== X-YMail-OSG: L0gkl_YVM1mExvP.A3qTFpCaZKU11CmsLudAagh03T9_w_oCD2ohXD6XYLYJvp9 OMZrvat2PGAT2gbg8sw5o5o871Aa0WrLEbPEjFxUPoBK2IYb7o2wVq970z93J4AuqhFXMqWFJPqn C_WXc4PWqGhvWTSfTB9aR37i.i7ExPQAYu07N.Q51Z7X8k9eMJFdEOHV.t1Zwq5lTXXfFtL9mbky HSuZfJQllhkTn8TsiTN962q2kPdmvhzEgAAsXIff83kR_GiKvdBP.ME_aPKPm4wtH0F2ngclq8XB VqqqomR6i4bZTJYJSXQtNUbvwhYYG9fonA2FJIRBhqlBSgszK2owsxkjWfdsWcie5AMEJpIimejZ MtiylMrkTg30vVWJjFo_fvW5KFtmNUoaD6twQ5WaolOA4sHILjVp2vtbwIDn4KUdHFmzMDxeEWpm N68TBB9knvL3M1TclUWBw0EnxU7Ksdw0WTom56S4VnQ9EGkMvJf5mjMBwMhB8IwYBpP3UnYjWq7j B8G0aafYOWu9fw2BQ_y87upT99qri43.YIF0Fd_LvhygJ33lpT1pIbQDxR1ZDbmceIw.W0ZYe3yI bDeLeXgWNucW.0j7jbPYxW1QRzG8qyk_GyqOxuqWG_gw.jYVw.BAq.57iNoUeXDaxPiKDKMTuRaN 3B_PgBlO84wR.BNVngnn86.rSuUyiRZVO572jGjMg9s4C0PvM9nNvXr0jyWhLiQg8mzCT1Qn.vyO rvpONZkA4nLAtAw62Q4UX0DmDcPUNqqcL7hqKpYDCiE0L48t3km4C1HIyC1BPjCiHPMR0MXpAiwD qq_Ugg6GVKtodpF.8RzPHL37GPHXb3Ohw9JvTBENs8hLDxsMc7kTxlwPsCK.tHctDby2938Sn8u8 LB_rNiatRvQp8OAVBKTkd0z0PO5pOj7R92VQcadqKwL6nvEDFeI4F0fSTFVwL5cvmBdhub5qGnIP Njg1U988m.p_dpjTHj5JAazwmHH2n5TIVfa_M57jL0A2cvcrDw2dqTeBGgoxvdVwTWt0wXYFhAvU UhJGrMWYNPXhUCKuzNIE3tzk2ngqcB55b1tz.QNEmIQps7LmapXGz2BnwNf11l1WCOY2erukys4. 575zWMZBd.n0rbRBuSV5QMvMPz6lTSIMD9So1dgLYDEx2v5mEooQTxlOFLlUESLGsmMXcgOmon4Q 6SIzsUeekOuUjiptVIwBTqpmwTN4F5qWM9h5WiSF3bjTpGUvtF_TMwfT4EuVd3DSGOXFdqc0fNj4 _NK54eXXzG50FYsKvqM4dgva7Ge.OdqWOrhTS92DGhbdr1wYcz9O6D7Lj.D6hObxi1YU508UNmjJ ZjZD_VTpBIJZJQUAYfSDP1xHPN2fFmG4YkNk6X82264oiGKhdBJhR649kypxd3cv3VupMslOdCRi JIGWmEGuuO1eV3lE0vBWJFoxj3NtzUOJneYhPd05n0DTOZ49iuogpIeRSavyY.dw2WD9cs4wuuJ0 ODUkRIHyPBU3ZM9sTgHA4q9OcunEzsFi8cgS2hf_VO4M20lloSIAGWIJPsDOUVc3ukUdIVm_SZyX sOk1RZlMEFEY5GVmfmN6KkOWLS.vpMOrOpD4jIdcksDm.Uz9M60SUtp_cvPHNLsMwvhybm.B62Wk jgXReAXpCGntNB30C4Kz_7JMzwbjgmSGPBpBCRiLAXOdA8liTaIGZ1JfR.oHdhIA6A0_gUZqUZ5S _DphHW4pqfubemR56HGjPLx3yYVvUkKafPCfMTTl5vnIkCzj_r7NjBgjButE4mENzJOxTocHHK6p 3MqK4KV1gy4BVlbZRTMkjUxTyNJwckskJUVIqdOQeIEAseuOfRx9F0s2QwRX1Q5U9Y4nO2h_PH3C Tvys8agKjQePg2olwcvtk6axjtq_e_9dxYc1VV88smKgde1LEehY40qkNu26U_KJ.52rxz.p5HzW aaCxbK_InU1cWXIVT3prgiLDrwQ8mZFa2ViitoolbZZM5IVyE1NF1W.TkQvnwnuMswW74cBlgkL9 IUGdoPZCDvI3r4w1MNzx8wCeDPGAvigLIeigLHoAmdYegexu89P8Me0pRYG7pQy8OT1Y9S6MGJQA VURGXouMcM74kvpubnOlif2eHK7rZtgRtDrClAWnfJI4dnlpMKRli7gVLDmq7VxMWv0BT97k3m.x pDTU.SGfhOmT4YZJ3RbRqZ2144LuPBDggDrNZ59d3d8Q_6QaGnZjvuBOcuyK_se5ku35tkU4x7I1 Fb_YuPmRWW7dVZMDEigENn7cyz3BikuCXqwjDNM_kmDxwpSN1BhZOcnk0PqFRQd1vDqzXsMjwbzp raoU- X-Sonic-MF: X-Sonic-ID: a6f144cf-a360-4d4f-89e0-e9ca5cd0ec3d Received: from sonic.gate.mail.ne1.yahoo.com by sonic309.consmr.mail.gq1.yahoo.com with HTTP; Fri, 24 Mar 2023 21:17:27 +0000 Received: by hermes--production-gq1-6cf7749bc8-kkr9m (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID 7f9549e00bc39a0fb6a0d458a969a94a; Fri, 24 Mar 2023 21:17:24 +0000 (UTC) Content-Type: text/plain; charset=us-ascii List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.400.51.1.1\)) Subject: Re: Periodic rant about SCHED_ULE From: Mark Millard In-Reply-To: Date: Fri, 24 Mar 2023 14:17:13 -0700 Cc: FreeBSD Hackers Content-Transfer-Encoding: quoted-printable Message-Id: <374296F5-892E-48F4-858D-20E15B494AE6@yahoo.com> References: <6A29E7ED-0A1E-49F9-9224-AC3D5B0E0732.ref@yahoo.com> <6A29E7ED-0A1E-49F9-9224-AC3D5B0E0732@yahoo.com> To: sgk@troutmask.apl.washington.edu X-Mailer: Apple Mail (2.3731.400.51.1.1) X-Rspamd-Queue-Id: 4Pjw5j467Rz3jQq X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N On Mar 24, 2023, at 13:25, Steve Kargl = wrote: > On Fri, Mar 24, 2023 at 12:47:08PM -0700, Mark Millard wrote: >> Steve Kargl wrote on >> Date: Wed, 22 Mar 2023 19:04:06 UTC : >>=20 >>> I reported the issue with ULE some 15 to 20 years ago. >>> I gave up reporting the issue. The individuals with the >>> requisite skills to hack on ULE did not; and yes, I lack >>> those skills. The path of least resistance is to use >>> 4BSD. >>>=20 >>> % cat a.f90 >>> ! >>> ! Silly numerically intensive computation. >>> ! >>> program foo >>> implicit none >>> integer, parameter :: m =3D 200, n =3D 1000, dp =3D kind(1.d0) >>> integer i >>> real(dp) x >>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:) >>> call random_init(.true., .true.) >>> allocate(a(n,n), b(n,n)) >>> do i =3D 1, m >>> call random_number(a) >>> call random_number(b) >>> c =3D matmul(a,b) >>> x =3D sum(c) >>> if (x < 0) stop 'Whoops' >>> end do >>> end program foo >>> % gfortran11 -o z -O3 -march=3Dnative a.f90=20 >>> % time ./z >>> 42.16 real 42.04 user 0.09 sys >>> % cat foo >>> #! /bin/csh >>> # >>> # Launch NCPU+1 images with a 1 second delay >>> # >>> foreach i (1 2 3 4 5 6 7 8 9) >>> ./z & >>> sleep 1 >>> end >>> % ./foo >>>=20 >>> In another xterm, you can watch the 9 images. >>>=20 >>> % top >>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 11:43:01 >>> 74 processes: 10 running, 64 sleeping >>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle >>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, = 14G Free >>> Swap: 16G Total, 16G Free >>>=20 >>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND >>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% z >>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% z >>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% z >>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% z >>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% z >>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% z >>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% z >>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% z >>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% z >>>=20 >>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's = exit >>> after 55-ish seconds. If you try this experiment on ULE, you'll get = NCPU-1 >>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as = the >>> two images ping-pong on one cpu. Back when I was testing ULE vs = 4BSD, >>> this was/is due to ULE's cpu affinity where processes never migrate = to >>> another cpu. Admittedly, this was several years ago. Maybe ULE has >>> gotten better, but George's rant seems to suggest otherwise. >>=20 >> Note: I'm only beginning to explore your report/case. >>=20 >> There is a significant difference in your report and >> George's report: his is tied to nice use (and I've >> replicated there being SCHED_4BSD vs. SCHED_ULE >> consequences in the same direction George reports >> but with much larger process counts involved). In >> those types of experiments, I without the nice use >> I did not find notable differences. But it is a >> rather different context than your examples. Thus >> the below as a start on separate experiments closer >> to what you report using. >=20 > Yes, I recognizes George's case is different. However, > the common problem is ULE. My testcase shows/suggests > that ULE is unsuitable for a HPC platform. >=20 >> Not (yet) having a Fortran set up I did some simple >> expriments with stress --cpu N (N processes looping >> sqrt calculations) and top. In top I sorted by pid >> to make it obvious if a fixed process was getting a >> fixed CPU or WCPU. (I tried looking at both CPU and >> WCPU, varying the time between samples as well. I >> also varied stress's --backoff N . This was on a >> ThreadRipper 1950X (32 hardware threads, so 16 cores) >> that was running: >=20 > You only need a numerically intensive program that runs > for 30-45 seconds. Well, with 32 hardware threads instead of 8, the time frames likely need to be longer proportionally: 33 processes created and run, with overlapping time needed. > I use Fortran everyday and wrote the > silly example in 5 minutes. The matrix multiplication > of two 1000x1000 double precision matrices has two > benefits with this synthetic benchmark. It takes 40-ish > seconds on my hardware (AMD FX-8350) and it blows out the > cpu cache. I've not checked on the caching issue for what I've done below. Let me know if you expect it is important to check. >> This seems at least suggestive that, in my context, the >> specific old behavior that you report does not show up, >> at least on the timescales that I was observing at. It >> still might not be something you would find appropriate, >> but its does appear to at least be different. >>=20 >> There is the possibility that stress --cpu N leads to >> more being involved than I expect and that such is >> contributing to the behavior that I've observed. >=20 > I can repeat the openmpi testing, but it will have to=20 > wait for a few weeks as I have a pressing deadline. I'll be curious to learn what you then find. > The openmpi program is a classic boss-worker scenario > (and an almost perfectly parallel application with litttle > communication overhead). boss starts and initializes the > environment and then launches numerical intensive=20 > workers. If boss+n workers > ncpu, you get a boss and > a worker pinned to a cpu. If boss and worker ping-pong, > it stalls the entire program. =46rom what I've seen, boss+1worker doing ping-pong at times would not be prevented from happening sometimes for a while but would not be sustained indefinitely. > Admittedly, I last tested years ago. ULE may have had > improvements. Actually I do have a fortran: gfortran12 (automatically). (My original search had a typo.) I'll have to adjust the parameters for your example: # gfortran12 -o z -O3 -march=3Dnative a.f90 # time ./z 27.91 real 27.85 user 0.06 sys but I've 32 hardware threads, so the loop waiting for 1 sec between for 33 examples would have the first ones exit before the last ones start. Looks like n=3D2000 would be sufficient: # gfortran12 -o z -O3 -march=3Dnative a.f90 # time ./z 211.25 real 211.06 user 0.18 sys For 33 processes, things are as I described when I look with the likes of: # top -a -opid -s5 Varying the time scale to shorter allows seeing process WCPU figures move around more between the processes more. Longer shows less of the WCPU variability across the processes. (As I remember, -s defaults to 3 seconds and has a minimum of 1 second.) Given the 32 hardware threads, I used 33 processes via: # more runz #! /bin/csh # # Launch NCPU+1 images with a 1 second delay # foreach d (1 2 3) foreach i (1 2 3 4 5 6 7 8 9 10) ./z & sleep 1 end end foreach j (1 2 3) ./z & sleep 1 end My guess is that if you end up seeing what you originally described, some environmental difference would be involved in why I see different behavior, something to then be tracked down for what is different in the 2 contexts. =3D=3D=3D Mark Millard marklmi at yahoo.com