From nobody Tue Apr 18 07:25:47 2023
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q0wZ140z3z45ttW
	for <freebsd-hackers@mlmmj.nyi.freebsd.org>; Tue, 18 Apr 2023 07:30:33 +0000 (UTC)
	(envelope-from fernando.apesteguia@gmail.com)
Received: from mail-lj1-x231.google.com (mail-lj1-x231.google.com [IPv6:2a00:1450:4864:20::231])
	(using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (2048 bits) client-digest SHA256)
	(Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4Q0wZ01M7gz3wcr;
	Tue, 18 Apr 2023 07:30:32 +0000 (UTC)
	(envelope-from fernando.apesteguia@gmail.com)
Authentication-Results: mx1.freebsd.org;
	none
Received: by mail-lj1-x231.google.com with SMTP id 38308e7fff4ca-2a8b3ecf59fso18257691fa.0;
        Tue, 18 Apr 2023 00:30:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20221208; t=1681803030; x=1684395030;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=TzFq95rTeNbNXfwnenlgx43+6cyjHBiD9Ek0/8yVYHQ=;
        b=p44gcZDVsSaZ9g8RJ66QPebERo7QWD53YJofjepFkWpNuajvpHTOA12cFOizY8oWsb
         vuSsPnZxNFPYtaTCpe0IspWOEzc5eVdwio4Gjp9l+2CYvZAadVpE0bKfkrQkKUyjIecW
         wx9Kzi8VLUeQfqV8GI+kPJxV9Mr/lvI/2KLLsqN5ZSS0vpuF9BQM0dJ0D6r7ucPs3O+C
         nJ4vtP0coYFNfzYCW+yz/kFC5/kkJEOAL3fo5xh7322QzfKoZzrbTQ4qyGTdFFZRDXcO
         awwIcxu/Xz2rsRjBRy++qp/2TqMmIZp+5r8Kpiphy5gb/W0niCm9HJq53RsL2Vhfxm+/
         Qgaw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1681803030; x=1684395030;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=TzFq95rTeNbNXfwnenlgx43+6cyjHBiD9Ek0/8yVYHQ=;
        b=VDRu0SPAh8Vc3+nCfGyT4eEIR1WozRhEW37dvy47AJgGAFBxESv6cVgS55BFRSlhYg
         jbgkqD0u+p+zxEiOsN7KYy6++ypewQDxPrvL60tThBMLRxJ9i05vIvDsYAVZKxv2qpkP
         oNhcxIsXDN3f4YCWkpxhUsNphfE/BXj4CEg4JIbdAjFQ9gr8hyF4AdTJPnS6Bbhy+ev7
         vbpP/uZ3l6nJsmK37alFYwXIAd5sDhscaouELVDL/d3fCtrMFQnHsytW0Bp8hQGeaolo
         Au9PzH1raKOV7Nr4YIbnDABat/xmnZBgkZczJrpp33x7Nn4n/um0I3DuDRqEzShl0rq4
         95BQ==
X-Gm-Message-State: AAQBX9esNUrN2j46r0mpXRVWgoRcioeue9KW8gxCm/Zu6dotLwTb+0ek
	rRoAy3PV0HcJRgv6laYT+wAu6YEiRdnAt3DuIjEnaTwb
X-Google-Smtp-Source: AKy350auzd8lDupW7KjmuKnYEWfU1+6zhdwzGcT0eAt0ECmmYqbz7T+TwvOtg8LV9EuXW1n9sPJWbF2H/FjD0NStitY=
X-Received: by 2002:a19:c20d:0:b0:4ed:b22a:da25 with SMTP id
 l13-20020a19c20d000000b004edb22ada25mr2884902lfc.11.1681803029747; Tue, 18
 Apr 2023 00:30:29 -0700 (PDT)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
References: <c3f5f667-ba0b-c40c-b8a6-19d1c9c63c5f@FreeBSD.org>
 <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu> <CAGudoHEj+koaYhkjzDE5KX9OsCno=X5M_E3z9uwg6Pg7dtqTsA@mail.gmail.com>
 <CAGudoHHxTT-Cn11zcFB3ZwF76UcRUv=QS28RLgzd=hVehTy0Kg@mail.gmail.com>
 <CAGudoHGoh30O-3O0jjwevDvP43-ykUt6JUDiwRNW918VZfybhA@mail.gmail.com>
 <CAGudoHEWfy61XSMhXdYOrKWVotuC0Kc6NSWiaaZCy6aQhbvXoQ@mail.gmail.com>
 <CAGudoHFPqz_LtsVNnz4P2gyKXz5Z8hU+v6QYGizm4+DtZRn8Yg@mail.gmail.com>
 <CAGudoHGzBjXjXZFs+qZJUS-M6VeX5=LB2ifRLP7hFBZXPvqP7g@mail.gmail.com>
 <ZCXsPWyIVmxvvHjE@nuc> <CAGudoHGaQxseby2Nc2_57HZ1ZLOwWSyrmZ_eUx15jLCm7znnsw@mail.gmail.com>
 <ZCcunLvPPPwhRjpe@framework> <CAGudoHF40zDwhaeO6-G7BHSzxJJ5ej3G490gpb06yd=OZ2do6A@mail.gmail.com>
In-Reply-To: <CAGudoHF40zDwhaeO6-G7BHSzxJJ5ej3G490gpb06yd=OZ2do6A@mail.gmail.com>
From: =?UTF-8?Q?Fernando_Apestegu=C3=ADa?= <fernando.apesteguia@gmail.com>
Date: Tue, 18 Apr 2023 09:25:47 +0200
Message-ID: <CAGwOe2apOhDJj_WYfkq5osohtkFrBgdks0B2J0wdiNFBYwSGsA@mail.gmail.com>
Subject: Re: Periodic rant about SCHED_ULE
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Mark Johnston <markj@freebsd.org>, freebsd-hackers@freebsd.org
Content-Type: multipart/alternative; boundary="0000000000002b9b9505f99748f9"
X-Rspamd-Queue-Id: 4Q0wZ01M7gz3wcr
X-Spamd-Bar: ----
X-Spamd-Result: default: False [-4.00 / 15.00];
	REPLY(-4.00)[];
	TAGGED_FROM(0.00)[];
	ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]
X-Rspamd-Pre-Result: action=no action;
	module=replies;
	Message is reply to one we originated
X-ThisMailContainsUnwantedMimeParts: N

--0000000000002b9b9505f99748f9
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Apr 17, 2023 at 7:35=E2=80=AFPM Mateusz Guzik <mjguzik@gmail.com> w=
rote:

> Ops, this fell through the cracks, apologies for such a late reply.
>
> On 3/31/23, Mark Johnston <markj@freebsd.org> wrote:
> > On Fri, Mar 31, 2023 at 08:41:41PM +0200, Mateusz Guzik wrote:
> >> On 3/30/23, Mark Johnston <markj@freebsd.org> wrote:
> >> > On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote:
> >> >> I looked into it a little more, below you can find summary and step=
s
> >> >> forward.
> >> >>
> >> >> First a general statement: while ULE does have performance bugs, it
> >> >> has better basis than 4BSD to make scheduling decisions. Most notab=
ly
> >> >> it understands CPU topology, at least for cases which don't involve
> >> >> big.LITTLE. For any non-freak case where 4BSD performs better, it i=
s
> a
> >> >> bug in ULE if this is for any reason other than a tradeoff which ca=
n
> >> >> be tweaked to line them up. Or more to the point, there should not =
be
> >> >> any legitimate reason to use 4BSD these days and modulo the bugs
> >> >> below, you are probably losing on performance for doing so.
> >> >>
> >> >> Bugs reported in this thread by others and confirmed by me:
> >> >> 1. failure to load-balance when having n CPUs and n + 1 workers --
> the
> >> >> excess one stays on one the same CPU thread continuously penalizing
> >> >> the same victim. as a result total real time to execute a finite
> >> >> computation is longer than in the case of 4BSD
> >> >> 2. unfairness of nice -n 20 threads vs threads going frequently off
> >> >> CPU (e.g., due to I/O) -- after using only a fraction of the slice
> the
> >> >> victim has to wait for the cpu hog to use up its entire slice, rins=
e
> >> >> and repeat. This extends a 7+ minute buildkernel to over 67 minutes=
,
> >> >> not an issue on 4BSD
> >> >>
> >> >> I did not put almost any effort into investigating no 1. There is
> code
> >> >> which is supposed to rebalance load across CPUs, someone(tm) will
> have
> >> >> to sit through it -- for all I know the fix is trivial.
> >> >>
> >> >> Fixing number 2 makes *another* bug more acute and it complicates t=
he
> >> >> whole ordeal.
> >> >>
> >> >> Thus, bug reported by me:
> >> >> 3. interactivity scoring is bogus -- originally introduced to detec=
t
> >> >> "interactive" behavior by equating being off CPU with waiting for
> user
> >> >> input. One part of the problem is that it puts *all* non-preempted
> off
> >> >> CPU time into one bag: a voluntary sleep. This includes suffering
> from
> >> >> lock contention in the kernel, lock contention in the program itsel=
f,
> >> >
> >> > Note that time spent off-CPU on a turnstile is not counted as sleepi=
ng
> >> > for the purpose of interactivity scoring, so this observation applie=
s
> >> > only to sx, lockmgr and sleepable rm locks.  That's not to say that
> >> > this
> >> > behaviour is correct, but it doesn't apply to some of the most
> >> > contended
> >> > locks unless I'm missing something.
> >> >
> >>
> >> page busy (massively contested for fork/exec), pipe_lock and even
> >> not-locks like waitpid(!)
> >
> > A program that spends most of its time blocked in waitpid, like a shell=
,
> > interactive or not, should indeed have a higher scheduling priority...
> >
>
> Maybe it should, but perhaps not at the expense of a more
> latency-sensitive program like a video player.
>
> The very notion that off cpu =3D=3D interactive dates back to the 80s
> where it probably made sense, as the unix systems at the time were
> mostly just terminal-only and the shell would indeed fit here very
> nicely.
>
> >> >> file I/O and so on, none of which has bearing on how interactive or
> >> >> not the program might happen to be. A bigger part of the problem is
> >> >> that at least today, the graphical programs don't even act this way
> to
> >> >> begin with -- they stay on CPU *a lot*.
> >> >
> >> > I think this statement deserves more nuance.  I was on a video call
> >> > just
> >> > now and firefox was consuming about the equivalent of 20-30% of a CP=
U
> >> > across all threads.  What kind of graphical programs are you talking
> >> > about specifically?
> >> >
> >>
> >> you don't consider 20-30% a lot?
> >
> > I would expect a program consuming 20-30% of a CPU to be prioritized
> > higher than a CPU hog.  And in my experience, running builds while on a
> > call doesn't hurt anything (usually).  Again, there is room for
> > improvement, I don't claim the scheduler is perfect.
> >
>
> As noted one of the performance bugs is that the scheduler
> *unintentionally* penalizes threads which go off cpu a lot for short
> periods. If scheduler keeps them in the batch range and there is a hog
> in the area, they are using getting disproportionately less cpu.
> kernel build is one example I noted -- several times in increase in
> total real time vs cpu hogs, while struggling to get any time. For all
> I know this bug is why it works fine for you.
>
> >> >> I asked people to provide me with the output of: dtrace -n
> >> >> 'sched:::on-cpu { @[execname] =3D lquantize(curthread->td_priority,=
 0,
> >> >> 224, 1); }' from their laptops/desktops.
> >> >>
> >> >> One finding is that most people (at least those who reported) use
> >> >> firefox.
> >> >>
> >> >> Another finding is that the browser is above the threshold which
> would
> >> >> be considered "interactive" for vast majority of the time in all
> >> >> reported cases.
> >> >
> >> > That is not true of the output that I sent.  There, most of the
> firefox
> >> > thread samples are in the interactive range [88-135].  Some show an
> >> > even
> >> > higher priority, maybe due to priority propagation.
> >> >
> >>
> >> That's not the interactive range. 88 is PRI_MIN_BATCH
> >
> > 88 is PRI_MIN_TIMESHARE (on main, stable/13 ranges are different I
> > think).  PRI_MIN_BATCH is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE =3D 88=
 +
> > 48 =3D 136.  Everything in [88-135] goes into the realtime queue.
> >
>
> You are right, I misread the code. static_boost seting prio to 72
> solidified my misread.
>
> Interestingly this does not change the crux of the matter -- that not
> interactive processes cluster in terms of priorities with one which
> are interactive. You can see it in your own report.
>
> >> >> I booted a 2 thread vm with xfce and decided to click around. Spawn=
ed
> >> >> firefox, opened a file manager (Thunar) and from there I opened a
> >> >> movie to play with mpv. As root I spawned make -j 2 buildkernel. it
> >> >> was not particularly good :)
> >> >>
> >> >> I found that mpv spawns a bunch of threads, most notably 2 distinct
> >> >> threads for audio and video output. The one for video got a priorit=
y
> >> >> of 175, while the rest had either 88 or 89 -- the lowest for
> >> >> timesharing not considered interactive [note lower is considered
> >> >> better].
> >> >
> >> > Presumably all of the video decoding was done in software, since
> you're
> >> > running in a VM?  On my desktop, mpv does not consume much CPU and i=
s
> >> > entirely interactive.  Your test suggests that you expect ULE to
> >> > prioritize a CPU hog, which doesn't seem realistic absent some
> >> > scheduling hints from the user or the program itself.  Problem 2 is
> the
> >> > opposite problem: timesharing CPU hogs are allowed to starve other
> >> > timesharing threads.
> >> >
> >>
> >> Now that I pointed out anything >=3D 88 is *NOT* interactive, are you
> >> sure your mpv was considered interactive anyway?
> >
> > Yes.
> >
>
> See above :)
>
> >> I don't expect ULE to prioritize CPU hogs. I'm pointing out how a hog
> >> which was a part of an interactive program got shafted, further
> >> showing how the method based on counting off cpu time does not work.
> >
> > You're saying that interactivity scoring should take into account
> > overall process behaviour instead of just thread behviour?  Sure, that
> > could be reasonable.
> >
>
> That's part of it, yes.
>
> >> >> At the same time the file manager who was left in the background ke=
pt
> >> >> doing evil syscall usage, which as a result bouncing between a
> regular
> >> >> timesharing priority and one which made it "interactive", even thou=
gh
> >> >> the program was not touched for minutes.
> >> >>
> >> >> Or to put it differently, the scheduler failed to recognize that mp=
v
> >> >> is the program to prioritize, all while thinking the background tim=
e
> >> >> waster is the thing to look after (so to speak).
> >> >>
> >> >> This brings us to fixing problem 2: currently, due to the existence
> of
> >> >> said problem, the interactivity scoring woes are less acute -- the
> >> >> venerable make -j example is struggling to get CPU time, as a resul=
t
> >> >> messing with real interactive programs to a lesser extent. If that
> >> >> gets fixed, we are in a different boat altogether.
> >> >>
> >> >> I don't see a clean solution.
> >> >>
> >> >> Right now I'm toying with the idea of either:
> >> >> 1. having programs explicitly tell the kernel they are interactive
> >> >
> >> > I don't see how this can work.  It's not just traditional
> "interactive"
> >> > programs that benefit from this scoring, it applies also to network
> >> > servers and other programs which spend most of their time sleeping b=
ut
> >> > want to handle requests with low latency.
> >> >
> >> > Such an interface would also let any program request soft realtime
> >> > scheduling without giving up the ability to monopolize CPU time, whi=
ch
> >> > goes against ULE's fairness goals.
> >> >
> >>
> >> Clearly it would be gated with some permission, so only available on a
> >> desktop for example.
> >>
> >> Then again see my response else in the thread: x server could be
> >> patched to mark threads.
> >
> > To do what?
> >
>
> To tell the kernel they are interactive clients so that it does not
> have to speculate.
>
> Same with pulseaudio and whatever direct /dev/dsp consumer.
>

I'm a bit concerned about the scalability of this approach. Wouldn't that
be like *a lot* of patching?

Somehow it feels to me that the scheduler should be the one correctly
discerning if a process is interactive or not instead of the client
application defining itself as such.
If the scheduler does the job, then it should be able to update the
"interactivity status" of a client if the client behavior changes.
Otherwise I suppose that change in behavior should be implemented in the
patch itself which might not be easy to do (or could introduce poor
performance if done improperly).
I'm thinking of graphical applications that might be considered interactive
but that at some point consume a lot of cpu like shotcut or openshot, or
probably these days even web browsers.

Also, since the scheduler is such a critical piece of software, I agree
with Jeff Roberson that a bigger test suite, including regression tests,
are necessary to ensure nothing breaks and we don't trade some use cases
for others.

P.S: This is a really informative thread. Thanks for working on this!


>
> >> And it does not go against any fairness goals -- it very much can be
> >> achieved, but one has information who can be put off cpu for a longer
> >> time without introducing issues.
> >>
> >> >> 2. adding a scheduler hook to /dev/dsp -- the observation is that i=
f
> a
> >> >> program is producing sound it probably should get some cpu time in =
a
> >> >> timely manner. this would cover audio/video players and web browser=
s,
> >> >
> >> > On my system at least firefox doesn't open /dev/dsp, it sends audio
> >> > streams to pulseaudio.
> >> >
> >>
> >> I think I noted elsewhere in the thread that pulseaudio may need the
> >> same treatment as the x server.
> >>
> >> >> but would not cover other programs (say a pdf reader). it may be it
> is
> >> >> good enough though
> >> >
> >> > I think some more thorough analysis, using tools like schedgraph or
> >> > KUtrace[1], is needed to characterize the problems you are reporting
> >> > with interactivity scoring.  It's also not clear how any of this wou=
ld
> >> > address the problem that started this thread, wherein two competing
> >> > timesharing (i.e., non-interactive) workloads get uneven amounts of
> CPU
> >> > time.
> >> >
> >>
> >> I explicitly stated I have not looked into this bit.
> >>
> >> > There is absolutely room for improvement in ULE's scheduling
> decisions.
> >> > It seems to be common practice to tune various ULE parameters to get
> >> > better interactive performance, but in general I see no analysis
> >> > explaining /why/ exactly they help and what goes wrong with the
> default
> >> > parameter values in specific workloads.  schedgraph is a very useful
> >> > tool for this sort of thing.
> >> >
> >>
> >> I tried schedgraph in the past to look at buildkernel and found it
> >> does not cope with the amount of threads, at least on my laptop.
> >>
> >> > Such tools also required to rule out bugs in ULE itself, when lookin=
g
> >> > at
> >> > abnormal scheduling behaviour.  Last year some scheduler races[2] we=
re
> >> > fixed that apparently hurt system performance on EPYC quite a bit.  =
I
> >> > was told privately that applying those patches to 13.1 improved IPSe=
c
> >> > throughput by ~25% on EPYC, and I wouldn't be surprised if there are
> >> > more improvements to be had which don't involve modifying core
> >> > heuristics of the scheduler.  Either way this requires deeper analys=
is
> >> > of ULE's micro-level behaviour; I don't think "interactivity scoring
> is
> >> > bogus" is a useful starting point.
> >> >
> >>
> >> I provided explicit examples how it marked a background thread as
> >> interactive, while the real hard worker (if you will) as not
> >> interactive, because said worker was not acting the way ULE expects.
> >>
> >> A bandaid for the time being will stop shafting processes giving up
> >> their time slice early in the batch queue, along with some fairness
> >> for the rest who does not (like firefox). I'll hack it up for testing.
> >>
> >> --
> >> Mateusz Guzik <mjguzik gmail.com>
> >
>
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
>

--0000000000002b9b9505f99748f9
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote">=
<div dir=3D"ltr" class=3D"gmail_attr">On Mon, Apr 17, 2023 at 7:35=E2=80=AF=
PM Mateusz Guzik &lt;<a href=3D"mailto:mjguzik@gmail.com" target=3D"_blank"=
>mjguzik@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote=
" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);=
padding-left:1ex">Ops, this fell through the cracks, apologies for such a l=
ate reply.<br>
<br>
On 3/31/23, Mark Johnston &lt;<a href=3D"mailto:markj@freebsd.org" target=
=3D"_blank">markj@freebsd.org</a>&gt; wrote:<br>
&gt; On Fri, Mar 31, 2023 at 08:41:41PM +0200, Mateusz Guzik wrote:<br>
&gt;&gt; On 3/30/23, Mark Johnston &lt;<a href=3D"mailto:markj@freebsd.org"=
 target=3D"_blank">markj@freebsd.org</a>&gt; wrote:<br>
&gt;&gt; &gt; On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote=
:<br>
&gt;&gt; &gt;&gt; I looked into it a little more, below you can find summar=
y and steps<br>
&gt;&gt; &gt;&gt; forward.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; First a general statement: while ULE does have performanc=
e bugs, it<br>
&gt;&gt; &gt;&gt; has better basis than 4BSD to make scheduling decisions. =
Most notably<br>
&gt;&gt; &gt;&gt; it understands CPU topology, at least for cases which don=
&#39;t involve<br>
&gt;&gt; &gt;&gt; big.LITTLE. For any non-freak case where 4BSD performs be=
tter, it is a<br>
&gt;&gt; &gt;&gt; bug in ULE if this is for any reason other than a tradeof=
f which can<br>
&gt;&gt; &gt;&gt; be tweaked to line them up. Or more to the point, there s=
hould not be<br>
&gt;&gt; &gt;&gt; any legitimate reason to use 4BSD these days and modulo t=
he bugs<br>
&gt;&gt; &gt;&gt; below, you are probably losing on performance for doing s=
o.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Bugs reported in this thread by others and confirmed by m=
e:<br>
&gt;&gt; &gt;&gt; 1. failure to load-balance when having n CPUs and n + 1 w=
orkers -- the<br>
&gt;&gt; &gt;&gt; excess one stays on one the same CPU thread continuously =
penalizing<br>
&gt;&gt; &gt;&gt; the same victim. as a result total real time to execute a=
 finite<br>
&gt;&gt; &gt;&gt; computation is longer than in the case of 4BSD<br>
&gt;&gt; &gt;&gt; 2. unfairness of nice -n 20 threads vs threads going freq=
uently off<br>
&gt;&gt; &gt;&gt; CPU (e.g., due to I/O) -- after using only a fraction of =
the slice the<br>
&gt;&gt; &gt;&gt; victim has to wait for the cpu hog to use up its entire s=
lice, rinse<br>
&gt;&gt; &gt;&gt; and repeat. This extends a 7+ minute buildkernel to over =
67 minutes,<br>
&gt;&gt; &gt;&gt; not an issue on 4BSD<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; I did not put almost any effort into investigating no 1. =
There is code<br>
&gt;&gt; &gt;&gt; which is supposed to rebalance load across CPUs, someone(=
tm) will have<br>
&gt;&gt; &gt;&gt; to sit through it -- for all I know the fix is trivial.<b=
r>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Fixing number 2 makes *another* bug more acute and it com=
plicates the<br>
&gt;&gt; &gt;&gt; whole ordeal.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Thus, bug reported by me:<br>
&gt;&gt; &gt;&gt; 3. interactivity scoring is bogus -- originally introduce=
d to detect<br>
&gt;&gt; &gt;&gt; &quot;interactive&quot; behavior by equating being off CP=
U with waiting for user<br>
&gt;&gt; &gt;&gt; input. One part of the problem is that it puts *all* non-=
preempted off<br>
&gt;&gt; &gt;&gt; CPU time into one bag: a voluntary sleep. This includes s=
uffering from<br>
&gt;&gt; &gt;&gt; lock contention in the kernel, lock contention in the pro=
gram itself,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Note that time spent off-CPU on a turnstile is not counted as=
 sleeping<br>
&gt;&gt; &gt; for the purpose of interactivity scoring, so this observation=
 applies<br>
&gt;&gt; &gt; only to sx, lockmgr and sleepable rm locks.=C2=A0 That&#39;s =
not to say that<br>
&gt;&gt; &gt; this<br>
&gt;&gt; &gt; behaviour is correct, but it doesn&#39;t apply to some of the=
 most<br>
&gt;&gt; &gt; contended<br>
&gt;&gt; &gt; locks unless I&#39;m missing something.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; page busy (massively contested for fork/exec), pipe_lock and even<=
br>
&gt;&gt; not-locks like waitpid(!)<br>
&gt;<br>
&gt; A program that spends most of its time blocked in waitpid, like a shel=
l,<br>
&gt; interactive or not, should indeed have a higher scheduling priority...=
<br>
&gt;<br>
<br>
Maybe it should, but perhaps not at the expense of a more<br>
latency-sensitive program like a video player.<br>
<br>
The very notion that off cpu =3D=3D interactive dates back to the 80s<br>
where it probably made sense, as the unix systems at the time were<br>
mostly just terminal-only and the shell would indeed fit here very<br>
nicely.<br>
<br>
&gt;&gt; &gt;&gt; file I/O and so on, none of which has bearing on how inte=
ractive or<br>
&gt;&gt; &gt;&gt; not the program might happen to be. A bigger part of the =
problem is<br>
&gt;&gt; &gt;&gt; that at least today, the graphical programs don&#39;t eve=
n act this way to<br>
&gt;&gt; &gt;&gt; begin with -- they stay on CPU *a lot*.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I think this statement deserves more nuance.=C2=A0 I was on a=
 video call<br>
&gt;&gt; &gt; just<br>
&gt;&gt; &gt; now and firefox was consuming about the equivalent of 20-30% =
of a CPU<br>
&gt;&gt; &gt; across all threads.=C2=A0 What kind of graphical programs are=
 you talking<br>
&gt;&gt; &gt; about specifically?<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; you don&#39;t consider 20-30% a lot?<br>
&gt;<br>
&gt; I would expect a program consuming 20-30% of a CPU to be prioritized<b=
r>
&gt; higher than a CPU hog.=C2=A0 And in my experience, running builds whil=
e on a<br>
&gt; call doesn&#39;t hurt anything (usually).=C2=A0 Again, there is room f=
or<br>
&gt; improvement, I don&#39;t claim the scheduler is perfect.<br>
&gt;<br>
<br>
As noted one of the performance bugs is that the scheduler<br>
*unintentionally* penalizes threads which go off cpu a lot for short<br>
periods. If scheduler keeps them in the batch range and there is a hog<br>
in the area, they are using getting disproportionately less cpu.<br>
kernel build is one example I noted -- several times in increase in<br>
total real time vs cpu hogs, while struggling to get any time. For all<br>
I know this bug is why it works fine for you.<br>
<br>
&gt;&gt; &gt;&gt; I asked people to provide me with the output of: dtrace -=
n<br>
&gt;&gt; &gt;&gt; &#39;sched:::on-cpu { @[execname] =3D lquantize(curthread=
-&gt;td_priority, 0,<br>
&gt;&gt; &gt;&gt; 224, 1); }&#39; from their laptops/desktops.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; One finding is that most people (at least those who repor=
ted) use<br>
&gt;&gt; &gt;&gt; firefox.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Another finding is that the browser is above the threshol=
d which would<br>
&gt;&gt; &gt;&gt; be considered &quot;interactive&quot; for vast majority o=
f the time in all<br>
&gt;&gt; &gt;&gt; reported cases.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; That is not true of the output that I sent.=C2=A0 There, most=
 of the firefox<br>
&gt;&gt; &gt; thread samples are in the interactive range [88-135].=C2=A0 S=
ome show an<br>
&gt;&gt; &gt; even<br>
&gt;&gt; &gt; higher priority, maybe due to priority propagation.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; That&#39;s not the interactive range. 88 is PRI_MIN_BATCH<br>
&gt;<br>
&gt; 88 is PRI_MIN_TIMESHARE (on main, stable/13 ranges are different I<br>
&gt; think).=C2=A0 PRI_MIN_BATCH is PRI_MIN_TIMESHARE + PRI_INTERACT_RANGE =
=3D 88 +<br>
&gt; 48 =3D 136.=C2=A0 Everything in [88-135] goes into the realtime queue.=
<br>
&gt;<br>
<br>
You are right, I misread the code. static_boost seting prio to 72<br>
solidified my misread.<br>
<br>
Interestingly this does not change the crux of the matter -- that not<br>
interactive processes cluster in terms of priorities with one which<br>
are interactive. You can see it in your own report.<br>
<br>
&gt;&gt; &gt;&gt; I booted a 2 thread vm with xfce and decided to click aro=
und. Spawned<br>
&gt;&gt; &gt;&gt; firefox, opened a file manager (Thunar) and from there I =
opened a<br>
&gt;&gt; &gt;&gt; movie to play with mpv. As root I spawned make -j 2 build=
kernel. it<br>
&gt;&gt; &gt;&gt; was not particularly good :)<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; I found that mpv spawns a bunch of threads, most notably =
2 distinct<br>
&gt;&gt; &gt;&gt; threads for audio and video output. The one for video got=
 a priority<br>
&gt;&gt; &gt;&gt; of 175, while the rest had either 88 or 89 -- the lowest =
for<br>
&gt;&gt; &gt;&gt; timesharing not considered interactive [note lower is con=
sidered<br>
&gt;&gt; &gt;&gt; better].<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Presumably all of the video decoding was done in software, si=
nce you&#39;re<br>
&gt;&gt; &gt; running in a VM?=C2=A0 On my desktop, mpv does not consume mu=
ch CPU and is<br>
&gt;&gt; &gt; entirely interactive.=C2=A0 Your test suggests that you expec=
t ULE to<br>
&gt;&gt; &gt; prioritize a CPU hog, which doesn&#39;t seem realistic absent=
 some<br>
&gt;&gt; &gt; scheduling hints from the user or the program itself.=C2=A0 P=
roblem 2 is the<br>
&gt;&gt; &gt; opposite problem: timesharing CPU hogs are allowed to starve =
other<br>
&gt;&gt; &gt; timesharing threads.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; Now that I pointed out anything &gt;=3D 88 is *NOT* interactive, a=
re you<br>
&gt;&gt; sure your mpv was considered interactive anyway?<br>
&gt;<br>
&gt; Yes.<br>
&gt;<br>
<br>
See above :)<br>
<br>
&gt;&gt; I don&#39;t expect ULE to prioritize CPU hogs. I&#39;m pointing ou=
t how a hog<br>
&gt;&gt; which was a part of an interactive program got shafted, further<br=
>
&gt;&gt; showing how the method based on counting off cpu time does not wor=
k.<br>
&gt;<br>
&gt; You&#39;re saying that interactivity scoring should take into account<=
br>
&gt; overall process behaviour instead of just thread behviour?=C2=A0 Sure,=
 that<br>
&gt; could be reasonable.<br>
&gt;<br>
<br>
That&#39;s part of it, yes.<br>
<br>
&gt;&gt; &gt;&gt; At the same time the file manager who was left in the bac=
kground kept<br>
&gt;&gt; &gt;&gt; doing evil syscall usage, which as a result bouncing betw=
een a regular<br>
&gt;&gt; &gt;&gt; timesharing priority and one which made it &quot;interact=
ive&quot;, even though<br>
&gt;&gt; &gt;&gt; the program was not touched for minutes.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Or to put it differently, the scheduler failed to recogni=
ze that mpv<br>
&gt;&gt; &gt;&gt; is the program to prioritize, all while thinking the back=
ground time<br>
&gt;&gt; &gt;&gt; waster is the thing to look after (so to speak).<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; This brings us to fixing problem 2: currently, due to the=
 existence of<br>
&gt;&gt; &gt;&gt; said problem, the interactivity scoring woes are less acu=
te -- the<br>
&gt;&gt; &gt;&gt; venerable make -j example is struggling to get CPU time, =
as a result<br>
&gt;&gt; &gt;&gt; messing with real interactive programs to a lesser extent=
. If that<br>
&gt;&gt; &gt;&gt; gets fixed, we are in a different boat altogether.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; I don&#39;t see a clean solution.<br>
&gt;&gt; &gt;&gt;<br>
&gt;&gt; &gt;&gt; Right now I&#39;m toying with the idea of either:<br>
&gt;&gt; &gt;&gt; 1. having programs explicitly tell the kernel they are in=
teractive<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I don&#39;t see how this can work.=C2=A0 It&#39;s not just tr=
aditional &quot;interactive&quot;<br>
&gt;&gt; &gt; programs that benefit from this scoring, it applies also to n=
etwork<br>
&gt;&gt; &gt; servers and other programs which spend most of their time sle=
eping but<br>
&gt;&gt; &gt; want to handle requests with low latency.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Such an interface would also let any program request soft rea=
ltime<br>
&gt;&gt; &gt; scheduling without giving up the ability to monopolize CPU ti=
me, which<br>
&gt;&gt; &gt; goes against ULE&#39;s fairness goals.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; Clearly it would be gated with some permission, so only available =
on a<br>
&gt;&gt; desktop for example.<br>
&gt;&gt;<br>
&gt;&gt; Then again see my response else in the thread: x server could be<b=
r>
&gt;&gt; patched to mark threads.<br>
&gt;<br>
&gt; To do what?<br>
&gt;<br>
<br>
To tell the kernel they are interactive clients so that it does not<br>
have to speculate.<br>
<br>
Same with pulseaudio and whatever direct /dev/dsp consumer.<br></blockquote=
><div><br></div><div>I&#39;m a bit concerned about the scalability of this =
approach. Wouldn&#39;t that be like *a lot* of patching?</div><div><br></di=
v><div>Somehow it feels to me that the scheduler should be the one correctl=
y discerning if a process is interactive or not instead of the client appli=
cation defining itself as such.</div><div>If the scheduler does the job, th=
en it should be able to update the &quot;interactivity status&quot; of a cl=
ient if the client behavior changes.</div><div>Otherwise I suppose that cha=
nge in behavior should be implemented in the patch itself which might not b=
e easy to do (or could introduce poor performance if done improperly).</div=
><div>I&#39;m thinking of graphical applications that might be considered i=
nteractive but that at some point consume a lot of cpu like shotcut or open=
shot, or probably these days even web browsers.<br></div><div><br></div><di=
v>Also, since the scheduler is such a critical piece of software, I agree w=
ith Jeff Roberson that a bigger test suite, including regression tests, are=
 necessary to ensure nothing breaks and we don&#39;t trade some use cases f=
or others.<br></div><div><br></div><div>P.S: This is a really informative t=
hread. Thanks for working on this!<br></div><div>=C2=A0</div><blockquote cl=
ass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid=
 rgb(204,204,204);padding-left:1ex">
<br>
&gt;&gt; And it does not go against any fairness goals -- it very much can =
be<br>
&gt;&gt; achieved, but one has information who can be put off cpu for a lon=
ger<br>
&gt;&gt; time without introducing issues.<br>
&gt;&gt;<br>
&gt;&gt; &gt;&gt; 2. adding a scheduler hook to /dev/dsp -- the observation=
 is that if a<br>
&gt;&gt; &gt;&gt; program is producing sound it probably should get some cp=
u time in a<br>
&gt;&gt; &gt;&gt; timely manner. this would cover audio/video players and w=
eb browsers,<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; On my system at least firefox doesn&#39;t open /dev/dsp, it s=
ends audio<br>
&gt;&gt; &gt; streams to pulseaudio.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; I think I noted elsewhere in the thread that pulseaudio may need t=
he<br>
&gt;&gt; same treatment as the x server.<br>
&gt;&gt;<br>
&gt;&gt; &gt;&gt; but would not cover other programs (say a pdf reader). it=
 may be it is<br>
&gt;&gt; &gt;&gt; good enough though<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I think some more thorough analysis, using tools like schedgr=
aph or<br>
&gt;&gt; &gt; KUtrace[1], is needed to characterize the problems you are re=
porting<br>
&gt;&gt; &gt; with interactivity scoring.=C2=A0 It&#39;s also not clear how=
 any of this would<br>
&gt;&gt; &gt; address the problem that started this thread, wherein two com=
peting<br>
&gt;&gt; &gt; timesharing (i.e., non-interactive) workloads get uneven amou=
nts of CPU<br>
&gt;&gt; &gt; time.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; I explicitly stated I have not looked into this bit.<br>
&gt;&gt;<br>
&gt;&gt; &gt; There is absolutely room for improvement in ULE&#39;s schedul=
ing decisions.<br>
&gt;&gt; &gt; It seems to be common practice to tune various ULE parameters=
 to get<br>
&gt;&gt; &gt; better interactive performance, but in general I see no analy=
sis<br>
&gt;&gt; &gt; explaining /why/ exactly they help and what goes wrong with t=
he default<br>
&gt;&gt; &gt; parameter values in specific workloads.=C2=A0 schedgraph is a=
 very useful<br>
&gt;&gt; &gt; tool for this sort of thing.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; I tried schedgraph in the past to look at buildkernel and found it=
<br>
&gt;&gt; does not cope with the amount of threads, at least on my laptop.<b=
r>
&gt;&gt;<br>
&gt;&gt; &gt; Such tools also required to rule out bugs in ULE itself, when=
 looking<br>
&gt;&gt; &gt; at<br>
&gt;&gt; &gt; abnormal scheduling behaviour.=C2=A0 Last year some scheduler=
 races[2] were<br>
&gt;&gt; &gt; fixed that apparently hurt system performance on EPYC quite a=
 bit.=C2=A0 I<br>
&gt;&gt; &gt; was told privately that applying those patches to 13.1 improv=
ed IPSec<br>
&gt;&gt; &gt; throughput by ~25% on EPYC, and I wouldn&#39;t be surprised i=
f there are<br>
&gt;&gt; &gt; more improvements to be had which don&#39;t involve modifying=
 core<br>
&gt;&gt; &gt; heuristics of the scheduler.=C2=A0 Either way this requires d=
eeper analysis<br>
&gt;&gt; &gt; of ULE&#39;s micro-level behaviour; I don&#39;t think &quot;i=
nteractivity scoring is<br>
&gt;&gt; &gt; bogus&quot; is a useful starting point.<br>
&gt;&gt; &gt;<br>
&gt;&gt;<br>
&gt;&gt; I provided explicit examples how it marked a background thread as<=
br>
&gt;&gt; interactive, while the real hard worker (if you will) as not<br>
&gt;&gt; interactive, because said worker was not acting the way ULE expect=
s.<br>
&gt;&gt;<br>
&gt;&gt; A bandaid for the time being will stop shafting processes giving u=
p<br>
&gt;&gt; their time slice early in the batch queue, along with some fairnes=
s<br>
&gt;&gt; for the rest who does not (like firefox). I&#39;ll hack it up for =
testing.<br>
&gt;&gt;<br>
&gt;&gt; --<br>
&gt;&gt; Mateusz Guzik &lt;mjguzik <a href=3D"http://gmail.com" rel=3D"nore=
ferrer" target=3D"_blank">gmail.com</a>&gt;<br>
&gt;<br>
<br>
<br>
-- <br>
Mateusz Guzik &lt;mjguzik <a href=3D"http://gmail.com" rel=3D"noreferrer" t=
arget=3D"_blank">gmail.com</a>&gt;<br>
<br>
</blockquote></div></div>

--0000000000002b9b9505f99748f9--