Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 31 Mar 2023 20:41:41 +0200
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Mark Johnston <markj@freebsd.org>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: Periodic rant about SCHED_ULE
Message-ID:  <CAGudoHGaQxseby2Nc2_57HZ1ZLOwWSyrmZ_eUx15jLCm7znnsw@mail.gmail.com>
In-Reply-To: <ZCXsPWyIVmxvvHjE@nuc>
References:  <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> <c3f5f667-ba0b-c40c-b8a6-19d1c9c63c5f@FreeBSD.org> <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu> <CAGudoHEj%2BkoaYhkjzDE5KX9OsCno=X5M_E3z9uwg6Pg7dtqTsA@mail.gmail.com> <CAGudoHHxTT-Cn11zcFB3ZwF76UcRUv=QS28RLgzd=hVehTy0Kg@mail.gmail.com> <CAGudoHGoh30O-3O0jjwevDvP43-ykUt6JUDiwRNW918VZfybhA@mail.gmail.com> <CAGudoHEWfy61XSMhXdYOrKWVotuC0Kc6NSWiaaZCy6aQhbvXoQ@mail.gmail.com> <CAGudoHFPqz_LtsVNnz4P2gyKXz5Z8hU%2Bv6QYGizm4%2BDtZRn8Yg@mail.gmail.com> <CAGudoHGzBjXjXZFs%2BqZJUS-M6VeX5=LB2ifRLP7hFBZXPvqP7g@mail.gmail.com> <ZCXsPWyIVmxvvHjE@nuc>

next in thread | previous in thread | raw e-mail | index | archive | help
On 3/30/23, Mark Johnston <markj@freebsd.org> wrote:
> On Thu, Mar 30, 2023 at 05:36:54PM +0200, Mateusz Guzik wrote:
>> I looked into it a little more, below you can find summary and steps
>> forward.
>>
>> First a general statement: while ULE does have performance bugs, it
>> has better basis than 4BSD to make scheduling decisions. Most notably
>> it understands CPU topology, at least for cases which don't involve
>> big.LITTLE. For any non-freak case where 4BSD performs better, it is a
>> bug in ULE if this is for any reason other than a tradeoff which can
>> be tweaked to line them up. Or more to the point, there should not be
>> any legitimate reason to use 4BSD these days and modulo the bugs
>> below, you are probably losing on performance for doing so.
>>
>> Bugs reported in this thread by others and confirmed by me:
>> 1. failure to load-balance when having n CPUs and n + 1 workers -- the
>> excess one stays on one the same CPU thread continuously penalizing
>> the same victim. as a result total real time to execute a finite
>> computation is longer than in the case of 4BSD
>> 2. unfairness of nice -n 20 threads vs threads going frequently off
>> CPU (e.g., due to I/O) -- after using only a fraction of the slice the
>> victim has to wait for the cpu hog to use up its entire slice, rinse
>> and repeat. This extends a 7+ minute buildkernel to over 67 minutes,
>> not an issue on 4BSD
>>
>> I did not put almost any effort into investigating no 1. There is code
>> which is supposed to rebalance load across CPUs, someone(tm) will have
>> to sit through it -- for all I know the fix is trivial.
>>
>> Fixing number 2 makes *another* bug more acute and it complicates the
>> whole ordeal.
>>
>> Thus, bug reported by me:
>> 3. interactivity scoring is bogus -- originally introduced to detect
>> "interactive" behavior by equating being off CPU with waiting for user
>> input. One part of the problem is that it puts *all* non-preempted off
>> CPU time into one bag: a voluntary sleep. This includes suffering from
>> lock contention in the kernel, lock contention in the program itself,
>
> Note that time spent off-CPU on a turnstile is not counted as sleeping
> for the purpose of interactivity scoring, so this observation applies
> only to sx, lockmgr and sleepable rm locks.  That's not to say that this
> behaviour is correct, but it doesn't apply to some of the most contended
> locks unless I'm missing something.
>

page busy (massively contested for fork/exec), pipe_lock and even
not-locks like waitpid(!)

>> file I/O and so on, none of which has bearing on how interactive or
>> not the program might happen to be. A bigger part of the problem is
>> that at least today, the graphical programs don't even act this way to
>> begin with -- they stay on CPU *a lot*.
>
> I think this statement deserves more nuance.  I was on a video call just
> now and firefox was consuming about the equivalent of 20-30% of a CPU
> across all threads.  What kind of graphical programs are you talking
> about specifically?
>

you don't consider 20-30% a lot?

>> I asked people to provide me with the output of: dtrace -n
>> 'sched:::on-cpu { @[execname] = lquantize(curthread->td_priority, 0,
>> 224, 1); }' from their laptops/desktops.
>>
>> One finding is that most people (at least those who reported) use
>> firefox.
>>
>> Another finding is that the browser is above the threshold which would
>> be considered "interactive" for vast majority of the time in all
>> reported cases.
>
> That is not true of the output that I sent.  There, most of the firefox
> thread samples are in the interactive range [88-135].  Some show an even
> higher priority, maybe due to priority propagation.
>

That's not the interactive range. 88 is PRI_MIN_BATCH

Then in tdq_runq_add you can see:

       if (pri < PRI_MIN_BATCH) { <-- aka 88
                ts->ts_runq = &tdq->tdq_realtime; <-- interactive
threads go here
        } else if (pri <= PRI_MAX_BATCH) { <-- this is where firefox
is along with make et al
                ts->ts_runq = &tdq->tdq_timeshare;
                KASSERT(pri <= PRI_MAX_BATCH && pri >= PRI_MIN_BATCH,
                        ("Invalid priority %d on timeshare runq", pri));
                /*
                 * This queue contains only priorities between MIN and MAX
                 * batch.  Use the whole queue to represent these values.
                 */
                if ((flags & (SRQ_BORROWING|SRQ_PREEMPTED)) == 0) {
                        pri = RQ_NQS * (pri - PRI_MIN_BATCH) / PRI_BATCH_RANGE;
                        pri = (pri + tdq->tdq_idx) % RQ_NQS;
                        /*
                         * This effectively shortens the queue by one so we
                         * can have a one slot difference between idx and
                         * ridx while we wait for threads to drain.
                         */
                        if (tdq->tdq_ridx != tdq->tdq_idx &&
                            pri == tdq->tdq_ridx)
                                pri = (unsigned char)(pri - 1) % RQ_NQS;
                } else
                        pri = tdq->tdq_ridx;
                runq_add_pri(ts->ts_runq, td, pri, flags);
                return;
        } else
                ts->ts_runq = &tdq->tdq_idle;

>> I booted a 2 thread vm with xfce and decided to click around. Spawned
>> firefox, opened a file manager (Thunar) and from there I opened a
>> movie to play with mpv. As root I spawned make -j 2 buildkernel. it
>> was not particularly good :)
>>
>> I found that mpv spawns a bunch of threads, most notably 2 distinct
>> threads for audio and video output. The one for video got a priority
>> of 175, while the rest had either 88 or 89 -- the lowest for
>> timesharing not considered interactive [note lower is considered
>> better].
>
> Presumably all of the video decoding was done in software, since you're
> running in a VM?  On my desktop, mpv does not consume much CPU and is
> entirely interactive.  Your test suggests that you expect ULE to
> prioritize a CPU hog, which doesn't seem realistic absent some
> scheduling hints from the user or the program itself.  Problem 2 is the
> opposite problem: timesharing CPU hogs are allowed to starve other
> timesharing threads.
>

Now that I pointed out anything >= 88 is *NOT* interactive, are you
sure your mpv was considered interactive anyway?

I don't expect ULE to prioritize CPU hogs. I'm pointing out how a hog
which was a part of an interactive program got shafted, further
showing how the method based on counting off cpu time does not work.

>> At the same time the file manager who was left in the background kept
>> doing evil syscall usage, which as a result bouncing between a regular
>> timesharing priority and one which made it "interactive", even though
>> the program was not touched for minutes.
>>
>> Or to put it differently, the scheduler failed to recognize that mpv
>> is the program to prioritize, all while thinking the background time
>> waster is the thing to look after (so to speak).
>>
>> This brings us to fixing problem 2: currently, due to the existence of
>> said problem, the interactivity scoring woes are less acute -- the
>> venerable make -j example is struggling to get CPU time, as a result
>> messing with real interactive programs to a lesser extent. If that
>> gets fixed, we are in a different boat altogether.
>>
>> I don't see a clean solution.
>>
>> Right now I'm toying with the idea of either:
>> 1. having programs explicitly tell the kernel they are interactive
>
> I don't see how this can work.  It's not just traditional "interactive"
> programs that benefit from this scoring, it applies also to network
> servers and other programs which spend most of their time sleeping but
> want to handle requests with low latency.
>
> Such an interface would also let any program request soft realtime
> scheduling without giving up the ability to monopolize CPU time, which
> goes against ULE's fairness goals.
>

Clearly it would be gated with some permission, so only available on a
desktop for example.

Then again see my response else in the thread: x server could be
patched to mark threads.

And it does not go against any fairness goals -- it very much can be
achieved, but one has information who can be put off cpu for a longer
time without introducing issues.

>> 2. adding a scheduler hook to /dev/dsp -- the observation is that if a
>> program is producing sound it probably should get some cpu time in a
>> timely manner. this would cover audio/video players and web browsers,
>
> On my system at least firefox doesn't open /dev/dsp, it sends audio
> streams to pulseaudio.
>

I think I noted elsewhere in the thread that pulseaudio may need the
same treatment as the x server.

>> but would not cover other programs (say a pdf reader). it may be it is
>> good enough though
>
> I think some more thorough analysis, using tools like schedgraph or
> KUtrace[1], is needed to characterize the problems you are reporting
> with interactivity scoring.  It's also not clear how any of this would
> address the problem that started this thread, wherein two competing
> timesharing (i.e., non-interactive) workloads get uneven amounts of CPU
> time.
>

I explicitly stated I have not looked into this bit.

> There is absolutely room for improvement in ULE's scheduling decisions.
> It seems to be common practice to tune various ULE parameters to get
> better interactive performance, but in general I see no analysis
> explaining /why/ exactly they help and what goes wrong with the default
> parameter values in specific workloads.  schedgraph is a very useful
> tool for this sort of thing.
>

I tried schedgraph in the past to look at buildkernel and found it
does not cope with the amount of threads, at least on my laptop.

> Such tools also required to rule out bugs in ULE itself, when looking at
> abnormal scheduling behaviour.  Last year some scheduler races[2] were
> fixed that apparently hurt system performance on EPYC quite a bit.  I
> was told privately that applying those patches to 13.1 improved IPSec
> throughput by ~25% on EPYC, and I wouldn't be surprised if there are
> more improvements to be had which don't involve modifying core
> heuristics of the scheduler.  Either way this requires deeper analysis
> of ULE's micro-level behaviour; I don't think "interactivity scoring is
> bogus" is a useful starting point.
>

I provided explicit examples how it marked a background thread as
interactive, while the real hard worker (if you will) as not
interactive, because said worker was not acting the way ULE expects.

A bandaid for the time being will stop shafting processes giving up
their time slice early in the batch queue, along with some fairness
for the rest who does not (like firefox). I'll hack it up for testing.

-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHGaQxseby2Nc2_57HZ1ZLOwWSyrmZ_eUx15jLCm7znnsw>