Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 20 Apr 2023 23:34:22 +0200
From:      Mateusz Guzik <mjguzik@gmail.com>
To:        Jeff Roberson <jroberson@jroberson.net>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: ULE process to resolution
Message-ID:  <CAGudoHFHb_e=FDSDchr0xszE9ge%2BknVZZ-0-Y6rkheB6rjX9Ww@mail.gmail.com>
In-Reply-To: <CAGudoHGvX7%2BJz%2B=rVH_tgY1AA9agruhsbPzuTAy2sZ8wuL1JwQ@mail.gmail.com>
References:  <a6066590-0b4d-b332-102a-9c2432cdfec6@jroberson.net> <CAGudoHGvX7%2BJz%2B=rVH_tgY1AA9agruhsbPzuTAy2sZ8wuL1JwQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On 4/4/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> Hello,
>
> On 3/31/23, Jeff Roberson <jroberson@jroberson.net> wrote:
>> As I read these threads I can state with a high degree of confidence that
>> many of these tests worked with superior results with ULE at one time.
>> It may be that tradeoffs have changed or exposed weaknesses, it may also
>> be that it's simply been broken over time.  I see a large number of
>> commits intended to address point issues and wonder whether we adequately
>> explored the consquences.  Indeed I see solutions involving tunables
>> proposed here that will definitively break other cases.
>>
>
> One of the reporters claims the bug they complain about was there
> since early days. This made me curious how many problems reproduce on
> something like 7.1 (dated 2009), to that end I created an 8 core vm
> which I ran of bunch of tests on in addition to main. All 3 problems
> reported below reproduced there, no X testing though :)
>
> Bugs (one not reported in the other thread):
> 1. threads walking around the machine when spending little time off
> cpu, all while the machine is otherwise idle
>
> The problem with this on bare metal is that the victim cpu may be
> partially powered off, so now there is latency stemming from poking it
> back up, whatever other migration cost aside.
>
> I noticed this few years back when looking at postgres -- both the
> server and pgbench would walk around everywhere, reducing perf. I
> checked this reproduces on fresh main. The box at hand as 2 sockets *
> 10 cores * 2 threads.
>
> I *suspect* this is adequately modeled with a microbenchmark
> https://github.com/antonblanchard/will-it-scale/ named
> context_switch1_processes -- it too experiences all-machine walk
> unless explicitly bound (pass -n to *not* bind it). I verified they
> walk all around on 7.1 as well, but I don't know if postgres also
> would.
>
> how to bench:
> su - postgres
> /usr/local/bin/pg_ctl -D /var/db/postgres/data15 -l logfile start
> pgbench -i -s 10
> pgbench -M prepared -S -T 800000 -c 1 -j 1 -P1 postgres
>
> ... and you are in.
>
> 2. unfairness when oversubscribing with cpu hogs
>
> Steve Kargl claims he reported this one numerous times since the early
> days of ULE, I confirmed it was a problem on 7.1 and is a problem
> today.
>
> Say an 8 core vm (with making sure these are cores pinned on the host)
>
> I'm going to copy paste my other message here:
> I wrote a cpu burning program (memset 1 MB in a loop, with enough
> iterations to take ~20 seconds on its own).
>
> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 distinct
> cores.
>
> The test runs *9* workers, here is a sample run:
> [copy]
> 4bsd:
>        23.18 real        20.81 user         0.00 sys
>        23.26 real        20.81 user         0.00 sys
>        23.30 real        20.81 user         0.00 sys
>        23.34 real        20.82 user         0.00 sys
>        23.41 real        20.81 user         0.00 sys
>        23.41 real        20.80 user         0.00 sys
>        23.42 real        20.80 user         0.00 sys
>        23.53 real        20.81 user         0.00 sys
>        23.60 real        20.80 user         0.00 sys
> 187.31s user 0.02s system 793% cpu 23.606 total
>
> ule:
>        20.67 real        20.04 user         0.00 sys
>        20.97 real        20.00 user         0.00 sys
>        21.45 real        20.29 user         0.00 sys
>        21.51 real        20.22 user         0.00 sys
>        22.77 real        20.04 user         0.00 sys
>        22.78 real        20.26 user         0.00 sys
>        23.42 real        20.04 user         0.00 sys
>        24.07 real        20.30 user         0.00 sys
>        24.46 real        20.16 user         0.00 sys
> 181.41s user 0.07s system 741% cpu 24.465 total
> [/paste]
>
> While ule spends fewer *cycles*, it spends more real time and it is
> *probably* bad.
>
> you can repro with:
> https://people.freebsd.org/~mjg/.junk/cpuburner1.c
> cc -O0 -o cpuburner1 cpuburner1.c
>
> and a magic script:
> #!/bin/sh
>
> ins=$1
>
> shift
>
> while [ $ins -ne 0 ]; do
>         time ./cpuburner1 $1 $2 &
>         ins=$((ins-1))
> done
>
> wait
>
> run like this, pick the second number to take 20-ish seconds on your cpu:
> sh burn.sh 1048576 500000
>
> 3. threads struggling to get back on cpu against nice -n 20 higs
>
> This acutely affects buildkernel.
>
> I once more played around, the bug was already there in 7.1, extending
> total time from ~4 minutes to 30.
>
> The problem is introduced with the machinery to attempt to provide
> fairness for pri <= PRI_MAX_BATCH. I verified that with straight up
> removing all of it. Then buildikernel managed to finish in sensible
> time, but the cpu hogs were overly negatively affected -- little cpu
> time and very unfairly distributed between them. Key point though that
> this *can* stick to close to base time.
>
> I had seen the patch from https://reviews.freebsd.org/D15985 , it does
> not fix the problem but it does alleviate it to some extent. It is
> weirdly hacky and seems to be targeting just the testcase you had
> instead of the more general problem.
>
> I applied it to a 2018-ish tree so that there are no woes from rebasing.
> stock:          290.95 real 2048.22 user 247.967 sys
> stock+hogs:     883.81 real 2111.34 user 189.42 sys
> patched+hogs:   460.84 real 2055.63 user 232.00 sys
>
> Interestingly stock kernel from that period is less affected by the
> general problem, but it is still pretty bad. With the patch things
> improve markedly, but there is still ~50% increase in real time which
> is way too much for being paired against -n 20.
>
> https://people.freebsd.org/~mjg/.junk/cpuburner2.c
>
> magic script:
> #!/bin/sh
>
> workers=$1
> n=$2
> size=$3
> bkw=$4
>
> echo workers $workers nice $n buildkernel $bkw
>
> shift
>
> while [ $workers -ne 0 ]; do
>         time nice -n $n ./cpuburner $size &
>         workers=$((workers-1))
> done
>
> time make -C /usr/src -ssss -j $bkw buildkernel > /dev/null
>
> # XXX webdev-style
> pkill cpuburner
>
> wait
>
> sample use: time sh burn+bk.sh 8 20 1048576 8
>
> I figured there would be a regression test suite available, with tests
> checking what happens for known cases with possibly contradictory
> requirements. Got nothing, instead I found people use hackbench (:S)
> or just a workload.
>
> All that said, I'm buggering off the subject. My interest in it was
> limited to the nice problem, since I have pretty good reasons to
> suspect this is what is causing pathological total real time instances
> for package builds.
>

Do you still plan to do anything here? 14.0 schedule has been posted
and it starts with this:
 head slush/KBI freeze:   April 25, 2023
 [... ALPHA builds ...]:  TBD (as-needed)
 stable/14 branch:        May 12, 2023
 releng/14.0 branch:      May 26, 2023
 BETA1 build starts:      May 26, 2023

iow there is not much time to make any fixes for the release.

That said, I had another look at your patch. It aged out of simple
forward porting:
commit 686bcb5c14aba6e67524be84e125bfdd3514db9e
Author: Jeff Roberson <jeff@FreeBSD.org>
Date:   Sun Dec 15 21:26:50 2019 +0000

    schedlock 4/4

and a follow up fixup:
commit 6d3f74a14a83b867c273c6be2599da182a9b9ec7
Author: Mark Johnston <markj@FreeBSD.org>
Date:   Thu Jul 14 10:21:28 2022 -0400

    sched_ule: Fix racy loads of pc_curthread

which whacked access to data your patch relies on. Not saying this
can't be augmented, just that it is extra churn.

I also looked into why there is still tons of cpu time for the niced
stuff and found the mechanism mostly does not work.

Here are some results from FreeBSD 7.1 (2009 vintage) running full
time cpu hogs with various nice levels:

prio  10 ops      12863
prio   0 ops      12846
prio  20 ops      12794

prio   0 ops      24949
prio  20 ops      13551

prio   0 ops      11327
prio -20 ops      19474
prio  20 ops       7575

As you can see that release had about 33/66 split for the 0 vs 20 case
which is alreadyp retty bad and funnily enough it had equal treatment
for 0 vs 10 vs 20.

Things further changed down the road and on fresh main it looks like this:

prio  10 ops       4390
prio   0 ops       4963
prio  20 ops       3941

prio   0 ops       7235
prio  20 ops       6059

prio -20 ops       7225
prio   0 ops       3763
prio  20 ops       2547

as in nice 20 is penalized even less vs 0.

tl;dr things were already bad on 7.1.

to repro:
fetch https://people.freebsd.org/~mjg/.junk/cpuburner-prio.c
fetch https://people.freebsd.org/~mjg/.junk/script3.sh
cpuset -l 2 sh script3.sh

-- 
Mateusz Guzik <mjguzik gmail.com>



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHFHb_e=FDSDchr0xszE9ge%2BknVZZ-0-Y6rkheB6rjX9Ww>