From nobody Wed Mar 22 19:23:42 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PhdgN11jcz41BpN
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Wed, 22 Mar 2023 19:23:44 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Received: from mail-oa1-x2a.google.com (mail-oa1-x2a.google.com [IPv6:2001:4860:4864:20::2a])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4PhdgM6Lb7z49Br;
	Wed, 22 Mar 2023 19:23:43 +0000 (UTC)
	(envelope-from mjguzik@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-oa1-x2a.google.com with SMTP id 586e51a60fabf-17aeb49429eso20151476fac.6;
        Wed, 22 Mar 2023 12:23:43 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112; t=1679513023;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=GcwW6GunFt6vJyXRox5gqeYCqjP7ND39NoyPZHapQQA=;
        b=VHkhIV3Eawf3eQ5MBW00ntd6blw3yjR1MyZ0vTU+/mp0Yae4XmVlUiBW7/NhB4ORhD
         PgLvYcne/W+O1wMJKY0YYmMQ7w8tru1ppdcblFs5N4jZYJKhV4IWLa+cu8iBRChjmTwo
         Y2gitwyYb0W3PkggrNpKCVlaKwT/reRKZ0DfQyO7czID/IoMv7H7uc+TkQuK5fFiJ8oi
         K8Ou/34yFbR04CkyMXvkklS4x7GEgDxjQ4pCu/gp1F1NQ4r06msdncFmOWB48QCyYfGl
         6BtPEIxbz2QPS5776ECcbYzUN6QUteNHynJf6ZngvkV2wi6hvDv5OixEDqcuK17UHMDv
         37pg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1679513023;
        h=cc:to:subject:message-id:date:from:references:in-reply-to
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=GcwW6GunFt6vJyXRox5gqeYCqjP7ND39NoyPZHapQQA=;
        b=cTjVfw/FMMCzHM3sSSwuqIrNRkQM75pUUhEMCBVSPq5/rfvdTjaiXIWi9KygSmUkY7
         DuDCzF2qHFZWnxCkYNsq7SiSEuK3dGvKTyMZ6Tn6CmY8DWtPmwL8Ku9Cy17Otlj1Scfo
         /YErmoMWpkVku1zRIvxI/V1/M7CKoRq3po4uNr36sbkPy6/9wkIVuWG8For1DNBJP71T
         labEWW2iUs5LYJP3fIbOUuv1Fm2fSUQIcN+I4eHf/7GW6HpWrug/7aio7sG96Xy1V7G7
         WWqzPayNOW+0AdrukLacXJoPnDO0pjKNs4Q0StvMzhJzqll4arHn8jEJsKyBAbR58ooU
         jbRA==
X-Gm-Message-State: AAQBX9fUOJcNXDRnlW+/7DPs6xGK+Mrv0YZXCYbxA/v0nZfFGm9DxRe6
	3sNc5xmMcOsAxi2qN2++vZSET6Fl7WEnRIVqzsGMDowK
X-Google-Smtp-Source: AKy350bIVs5KP/r9XnotBS6z/8umamC4t9a+3cN9xVAin5qs8zV8mzdH2VcbX1pIknkEBe/hcLufzkj7YF2qu+QdrL4=
X-Received: by 2002:a05:6870:1297:b0:17a:a52d:9df7 with SMTP id
 23-20020a056870129700b0017aa52d9df7mr346958oal.4.1679513022832; Wed, 22 Mar
 2023 12:23:42 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Received: by 2002:a8a:1922:0:b0:49c:b071:b1e3 with HTTP; Wed, 22 Mar 2023
 12:23:42 -0700 (PDT)
In-Reply-To: <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu>
References: <a401e51a-250a-64a0-15cb-ff79bcefbf94@m5p.com> <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org>
 <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> <c3f5f667-ba0b-c40c-b8a6-19d1c9c63c5f@FreeBSD.org>
 <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu>
From: Mateusz Guzik <mjguzik@gmail.com>
Date: Wed, 22 Mar 2023 20:23:42 +0100
Message-ID: <CAGudoHEj+koaYhkjzDE5KX9OsCno=X5M_E3z9uwg6Pg7dtqTsA@mail.gmail.com>
Subject: Re: Periodic rant about SCHED_ULE
To: sgk@troutmask.apl.washington.edu
Cc: Matthias Andree <mandree@freebsd.org>, freebsd-hackers@freebsd.org
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 4PhdgM6Lb7z49Br
X-Spamd-Bar: ----
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	ASN(0.00)[asn:15169, ipnet:2001:4860:4864::/48, country:US]
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-ThisMailContainsUnwantedMimeParts: N

On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote:
> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote:
>>
>> Yes, there are reports that FreeBSD is not responsive by default - but
>> this
>> may make it get overall better throughput at the expense of
>> responsiveness,
>> because it might be doing fewer context switches.  So just complaining
>> about
>> a longer buildworld without seeing how much dnetc did in the same
>> wallclock
>> time period is useless.  Periodic rant's don't fix this lack of
>> information.
>>
>
> I reported the issue with ULE some 15 to 20 years ago.
> I gave up reporting the issue.  The individuals with the
> requisite skills to hack on ULE did not; and yes, I lack
> those skills.  The path of least resistance is to use
> 4BSD.
>
> %  cat a.f90
> !
> ! Silly numerically intensive computation.
> !
> program foo
>    implicit none
>    integer, parameter :: m = 200, n = 1000, dp = kind(1.d0)
>    integer i
>    real(dp) x
>    real(dp), allocatable :: a(:,:), b(:,:), c(:,:)
>    call random_init(.true., .true.)
>    allocate(a(n,n), b(n,n))
>    do i = 1, m
>       call random_number(a)
>       call random_number(b)
>       c = matmul(a,b)
>       x = sum(c)
>       if (x < 0) stop 'Whoops'
>    end do
> end program foo
> % gfortran11 -o z -O3 -march=native a.f90
> % time ./z
>        42.16 real        42.04 user         0.09 sys
> % cat foo
> #! /bin/csh
> #
> # Launch NCPU+1 images with a 1 second delay
> #
> foreach i (1 2 3 4 5 6 7 8 9)
>    ./z &
>    sleep 1
> end
> % ./foo
>
> In another xterm, you can watch the 9 images.
>
> % top
> st pid:  1709;  load averages:  4.90,  1.61,  0.79    up 0+00:56:46
> 11:43:01
> 74 processes:  10 running, 64 sleeping
> CPU: 99.9% user,  0.0% nice,  0.1% system,  0.0% interrupt,  0.0% idle
> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G Free
> Swap: 16G Total, 16G Free
>
>   PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME     CPU
> COMMAND
>  1699 kargl         1  56    0    68M    35M RUN      3   0:41  92.60% z
>  1701 kargl         1  56    0    68M    35M RUN      0   0:41  92.33% z
>  1689 kargl         1  56    0    68M    35M CPU5     5   0:47  91.63% z
>  1691 kargl         1  56    0    68M    35M CPU0     0   0:45  89.91% z
>  1695 kargl         1  56    0    68M    35M CPU2     2   0:43  88.56% z
>  1697 kargl         1  56    0    68M    35M CPU6     6   0:42  88.48% z
>  1705 kargl         1  55    0    68M    35M CPU1     1   0:39  88.12% z
>  1703 kargl         1  56    0    68M    35M CPU4     4   0:39  87.86% z
>  1693 kargl         1  56    0    68M    35M CPU7     7   0:45  78.12% z
>
> With 4BSD, you see the ./z's with 80% or greater CPU.  All the ./z's exit
> after 55-ish seconds.  If you try this experiment on ULE, you'll get NCPU-1
> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as the
> two images ping-pong on one cpu.  Back when I was testing ULE vs 4BSD,
> this was/is due to ULE's cpu affinity where processes never migrate to
> another cpu.  Admittedly, this was several years ago.  Maybe ULE has
> gotten better, but George's rant seems to suggest otherwise.
>

While I have not tried openmpi yet, I can confirm there is a problem
here, but the situtation is not as clear cut as one might think.

sched_4bsd *round robins* all workers across all CPUs, which comes at
a performance *hit* compared to ule if number of workers is <= CPU
count -- here ule maintains affinity, avoiding cache busting. If you
slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally
penalizing everyone, while ule mostly penalizes a select victim. By
the end of it, you get lower total cpu time spent, but higher total
real time. See below for an example.

Two massive problems with 4bsd, apart from mandatory round robin which
also happens to help by accident:
1. it has one *global* lock, meaning the scheduler itself just does
not scale and this is visible at modest contemporary scales
2. it does not understand topology -- no accounting done for ht nor
numa, but i concede the latter wont be a factor for most people

That said, ULE definitely has performance bugs which need to be fixed.
At least for the case below, 4BSD just "lucked" into sucking less
simply because it is so primitive.

I wrote a cpu burning program (memset 1 MB in a loop, with enough
iterations to take ~20 seconds on its own).

I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct cores.

The test runs *9* workers, here is a sample run:

4bsd:
       23.18 real        20.81 user         0.00 sys
       23.26 real        20.81 user         0.00 sys
       23.30 real        20.81 user         0.00 sys
       23.34 real        20.82 user         0.00 sys
       23.41 real        20.81 user         0.00 sys
       23.41 real        20.80 user         0.00 sys
       23.42 real        20.80 user         0.00 sys
       23.53 real        20.81 user         0.00 sys
       23.60 real        20.80 user         0.00 sys
187.31s user 0.02s system 793% cpu 23.606 total

ule:
       20.67 real        20.04 user         0.00 sys
       20.97 real        20.00 user         0.00 sys
       21.45 real        20.29 user         0.00 sys
       21.51 real        20.22 user         0.00 sys
       22.77 real        20.04 user         0.00 sys
       22.78 real        20.26 user         0.00 sys
       23.42 real        20.04 user         0.00 sys
       24.07 real        20.30 user         0.00 sys
       24.46 real        20.16 user         0.00 sys
181.41s user 0.07s system 741% cpu 24.465 total

It reliably uses 187s user time on 4BSD and 181s on ULE. At the same
time it also reliably has massive time imblance between
fastest/slowest in terms of total real time between workers *and* ULE
reliably uses more total real time to finish the entire thing.

iow this is a tradeoff, but most likely a bad one

-- 
Mateusz Guzik <mjguzik gmail.com>