From nobody Mon Mar 27 12:13:44 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4PlWv008qQz42C12 for ; Mon, 27 Mar 2023 12:13:48 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-ot1-x32d.google.com (mail-ot1-x32d.google.com [IPv6:2607:f8b0:4864:20::32d]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4PlWty6bVBz3hld; Mon, 27 Mar 2023 12:13:46 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=VDVMJmC6; spf=pass (mx1.freebsd.org: domain of mjguzik@gmail.com designates 2607:f8b0:4864:20::32d as permitted sender) smtp.mailfrom=mjguzik@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ot1-x32d.google.com with SMTP id r40-20020a05683044a800b006a14270bc7eso1078576otv.6; Mon, 27 Mar 2023 05:13:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1679919225; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=QO+Ser1AXpRsgldlo2kK9vLuNgEyHN4IF1pXN2VzGyU=; b=VDVMJmC6E2JrXHRsmpGvMJs4sEgIjIQRTdVuFRu5qKqJRAL6XmhrASzeG1OMDZuV/3 rXQjwSsmHLwKvDE4wkD1UiNa+qHKj1gm6G46bLXvInlBtKWIuRsKpjxWXbxA6jMSXx2h ZYjx7a7qT4cYPdpydywz+pr7KQFGGHD3olnr2IdAd5iL6i77ArtPmJ6Bzh6LW3DfmbNI kF1LeDkRjEXH4RBC+5+nApPxIU8en4A033noBSCN1s5/gLMqIh76gpqweZ9jaLTFrc6p WbNDOJUWS0LFW6jzWZZAa/JyZci8npGVPtBswo96c/m+QfzZWafTROk8JAVWJsM4kNlL 4hsQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679919225; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=QO+Ser1AXpRsgldlo2kK9vLuNgEyHN4IF1pXN2VzGyU=; b=iQO1HhmsXXEwD6Z4kWlJhPe1jaBl1Utxxmt6cijFTUvJx9tfc0CsMDeEY8D9yiC7kN 9GbGPKZeyUG5AJrSBeS9OnkOP8GLwMyI7gxWqwalXG489CjFc5widKNtmPMkOO7MMsxO 2Cn2OMMO7Q1R/mrVSiPYl+thHNHocK5uR00svFjFb31Lxvv05SciUXLsOXfaVrOafVMN YJ0SPXgOXSTaUf+SGKT8VMlC+QojB23Lv1rslPUtqglgY3NRSAMHsV0NCgQMCyilBfjd znicRubrlSLkyKimTGEiIbqWnZHcZE40WjknUopHP+Nmwfg9Z3k7p1wZHhr75wJm3mu9 d4cQ== X-Gm-Message-State: AO0yUKU84V4xLapRd8/h7UfcL8aSfvO0GULqxWxG1ywtBop7Tmts47jL 1NxkLiyvsPkXUXWvoVwfIvxdHT1LkA5vOuUEpl8vb+GD X-Google-Smtp-Source: AK7set/pGbosf55toylK3a9gls3v682ZkjbU6snZUeTglzW/+Dc5NCv3mC27Rley/BMCgSiGikBWe/zrDOuFlVwmMp4= X-Received: by 2002:a9d:63c5:0:b0:698:f988:7c30 with SMTP id e5-20020a9d63c5000000b00698f9887c30mr3777695otl.2.1679919225228; Mon, 27 Mar 2023 05:13:45 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:a8a:12c2:0:b0:49c:b071:b1e3 with HTTP; Mon, 27 Mar 2023 05:13:44 -0700 (PDT) In-Reply-To: References: <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> From: Mateusz Guzik Date: Mon, 27 Mar 2023 14:13:44 +0200 Message-ID: Subject: Re: Periodic rant about SCHED_ULE To: sgk@troutmask.apl.washington.edu Cc: Matthias Andree , freebsd-hackers@freebsd.org Content-Type: text/plain; charset="UTF-8" X-Spamd-Result: default: False [-2.12 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; NEURAL_SPAM_SHORT(0.88)[0.876]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; MIME_GOOD(-0.10)[text/plain]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::32d:from]; FROM_EQ_ENVFROM(0.00)[]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; ARC_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; DKIM_TRACE(0.00)[gmail.com:+]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; FROM_HAS_DN(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; TO_DN_SOME(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim] X-Rspamd-Queue-Id: 4PlWty6bVBz3hld X-Spamd-Bar: -- X-ThisMailContainsUnwantedMimeParts: N On 3/25/23, Mateusz Guzik wrote: > On 3/23/23, Mateusz Guzik wrote: >> On 3/22/23, Mateusz Guzik wrote: >>> On 3/22/23, Steve Kargl wrote: >>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote: >>>>> >>>>> Yes, there are reports that FreeBSD is not responsive by default - but >>>>> this >>>>> may make it get overall better throughput at the expense of >>>>> responsiveness, >>>>> because it might be doing fewer context switches. So just complaining >>>>> about >>>>> a longer buildworld without seeing how much dnetc did in the same >>>>> wallclock >>>>> time period is useless. Periodic rant's don't fix this lack of >>>>> information. >>>>> >>>> >>>> I reported the issue with ULE some 15 to 20 years ago. >>>> I gave up reporting the issue. The individuals with the >>>> requisite skills to hack on ULE did not; and yes, I lack >>>> those skills. The path of least resistance is to use >>>> 4BSD. >>>> >>>> % cat a.f90 >>>> ! >>>> ! Silly numerically intensive computation. >>>> ! >>>> program foo >>>> implicit none >>>> integer, parameter :: m = 200, n = 1000, dp = kind(1.d0) >>>> integer i >>>> real(dp) x >>>> real(dp), allocatable :: a(:,:), b(:,:), c(:,:) >>>> call random_init(.true., .true.) >>>> allocate(a(n,n), b(n,n)) >>>> do i = 1, m >>>> call random_number(a) >>>> call random_number(b) >>>> c = matmul(a,b) >>>> x = sum(c) >>>> if (x < 0) stop 'Whoops' >>>> end do >>>> end program foo >>>> % gfortran11 -o z -O3 -march=native a.f90 >>>> % time ./z >>>> 42.16 real 42.04 user 0.09 sys >>>> % cat foo >>>> #! /bin/csh >>>> # >>>> # Launch NCPU+1 images with a 1 second delay >>>> # >>>> foreach i (1 2 3 4 5 6 7 8 9) >>>> ./z & >>>> sleep 1 >>>> end >>>> % ./foo >>>> >>>> In another xterm, you can watch the 9 images. >>>> >>>> % top >>>> st pid: 1709; load averages: 4.90, 1.61, 0.79 up 0+00:56:46 >>>> 11:43:01 >>>> 74 processes: 10 running, 64 sleeping >>>> CPU: 99.9% user, 0.0% nice, 0.1% system, 0.0% interrupt, 0.0% idle >>>> Mem: 369M Active, 187M Inact, 240K Laundry, 889M Wired, 546M Buf, 14G >>>> Free >>>> Swap: 16G Total, 16G Free >>>> >>>> PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU >>>> COMMAND >>>> 1699 kargl 1 56 0 68M 35M RUN 3 0:41 92.60% >>>> z >>>> 1701 kargl 1 56 0 68M 35M RUN 0 0:41 92.33% >>>> z >>>> 1689 kargl 1 56 0 68M 35M CPU5 5 0:47 91.63% >>>> z >>>> 1691 kargl 1 56 0 68M 35M CPU0 0 0:45 89.91% >>>> z >>>> 1695 kargl 1 56 0 68M 35M CPU2 2 0:43 88.56% >>>> z >>>> 1697 kargl 1 56 0 68M 35M CPU6 6 0:42 88.48% >>>> z >>>> 1705 kargl 1 55 0 68M 35M CPU1 1 0:39 88.12% >>>> z >>>> 1703 kargl 1 56 0 68M 35M CPU4 4 0:39 87.86% >>>> z >>>> 1693 kargl 1 56 0 68M 35M CPU7 7 0:45 78.12% >>>> z >>>> >>>> With 4BSD, you see the ./z's with 80% or greater CPU. All the ./z's >>>> exit >>>> after 55-ish seconds. If you try this experiment on ULE, you'll get >>>> NCPU-1 >>>> ./z's with nearly 99% CPU and 2 ./z's with something like 45-ish% as >>>> the >>>> two images ping-pong on one cpu. Back when I was testing ULE vs 4BSD, >>>> this was/is due to ULE's cpu affinity where processes never migrate to >>>> another cpu. Admittedly, this was several years ago. Maybe ULE has >>>> gotten better, but George's rant seems to suggest otherwise. >>>> >>> >>> While I have not tried openmpi yet, I can confirm there is a problem >>> here, but the situtation is not as clear cut as one might think. >>> >>> sched_4bsd *round robins* all workers across all CPUs, which comes at >>> a performance *hit* compared to ule if number of workers is <= CPU >>> count -- here ule maintains affinity, avoiding cache busting. If you >>> slap in $cpu_count + 1 workers, 4bsd keeps the round robin equally >>> penalizing everyone, while ule mostly penalizes a select victim. By >>> the end of it, you get lower total cpu time spent, but higher total >>> real time. See below for an example. >>> >>> Two massive problems with 4bsd, apart from mandatory round robin which >>> also happens to help by accident: >>> 1. it has one *global* lock, meaning the scheduler itself just does >>> not scale and this is visible at modest contemporary scales >>> 2. it does not understand topology -- no accounting done for ht nor >>> numa, but i concede the latter wont be a factor for most people >>> >>> That said, ULE definitely has performance bugs which need to be fixed. >>> At least for the case below, 4BSD just "lucked" into sucking less >>> simply because it is so primitive. >>> >>> I wrote a cpu burning program (memset 1 MB in a loop, with enough >>> iterations to take ~20 seconds on its own). >>> >>> I booted an 8 core bhyve vm, where I made sure to cpuset is to 8 >>> distinct >>> cores. >>> >>> The test runs *9* workers, here is a sample run: >>> >>> 4bsd: >>> 23.18 real 20.81 user 0.00 sys >>> 23.26 real 20.81 user 0.00 sys >>> 23.30 real 20.81 user 0.00 sys >>> 23.34 real 20.82 user 0.00 sys >>> 23.41 real 20.81 user 0.00 sys >>> 23.41 real 20.80 user 0.00 sys >>> 23.42 real 20.80 user 0.00 sys >>> 23.53 real 20.81 user 0.00 sys >>> 23.60 real 20.80 user 0.00 sys >>> 187.31s user 0.02s system 793% cpu 23.606 total >>> >>> ule: >>> 20.67 real 20.04 user 0.00 sys >>> 20.97 real 20.00 user 0.00 sys >>> 21.45 real 20.29 user 0.00 sys >>> 21.51 real 20.22 user 0.00 sys >>> 22.77 real 20.04 user 0.00 sys >>> 22.78 real 20.26 user 0.00 sys >>> 23.42 real 20.04 user 0.00 sys >>> 24.07 real 20.30 user 0.00 sys >>> 24.46 real 20.16 user 0.00 sys >>> 181.41s user 0.07s system 741% cpu 24.465 total >>> >>> It reliably uses 187s user time on 4BSD and 181s on ULE. At the same >>> time it also reliably has massive time imblance between >>> fastest/slowest in terms of total real time between workers *and* ULE >>> reliably uses more total real time to finish the entire thing. >>> >>> iow this is a tradeoff, but most likely a bad one >>> >> >> So I also ran the following setup: 8 core vm doing -j 8 buildkernel, >> while 8 nice -n 20 processes are cpu-bound. After the build ends >> workers report how many ops they did in that time. >> >> ule is way off the reservation here. >> >> unimpeded build takes: 441.39 real 3135.63 user 266.92, similar on >> both schedulers >> >> with cpu hoggers: >> 4bsd: 657.22 real 3152.02 user 253.30 sys [+49%] >> ule: 4427.69 real 3225.33 user 194.86 sys [+903%] >> >> ule spends less time in the kernel, but the time blows up -- over 10 x >> the base line. >> this is clearly a total non-starter. >> >> full stats: >> >> 4bsd: >> hogger pid/ops >> 58315: 5546013 >> 58322: 5557294 >> 58321: 5545052 >> 58313: 5546347 >> 58318: 5537874 >> 58317: 5555303 >> 58323: 5545116 >> 58320: 5548530 >> >> runtimes: >> >> 657.23 real 230.02 user 0.02 sys >> 657.23 real 229.83 user 0.00 sys >> 657.23 real 230.50 user 0.00 sys >> 657.23 real 230.53 user 0.00 sys >> 657.23 real 230.14 user 0.01 sys >> 657.23 real 230.19 user 0.00 sys >> 657.23 real 230.09 user 0.00 sys >> 657.23 real 230.10 user 0.03 sys >> >> kernel build: >> 657.22 real 3152.02 user 253.30 sys >> >> ule: >> hogger pid/ops >> 77794: 95916836 >> 77792: 95909794 >> 77789: 96454064 >> 77796: 95813778 >> 77791: 96728121 >> 77795: 96678291 >> 77798: 97258060 >> 77797: 96347984 >> >> 4427.70 real 4001.94 user 0.10 sys >> 4427.70 real 3980.68 user 0.16 sys >> 4427.70 real 3973.96 user 0.10 sys >> 4427.70 real 3980.11 user 0.13 sys >> 4427.70 real 4012.32 user 0.07 sys >> 4427.70 real 4008.79 user 0.12 sys >> 4427.70 real 4034.77 user 0.09 sys >> 4427.70 real 3995.40 user 0.08 sys >> >> kernel build: >> 4427.69 real 3225.33 user 194.86 sys >> > > added a probe to runq_add*: > diff --git a/sys/kern/kern_switch.c b/sys/kern/kern_switch.c > index aee0c9cbd1ae..db73226547c5 100644 > --- a/sys/kern/kern_switch.c > +++ b/sys/kern/kern_switch.c > @@ -357,6 +357,9 @@ runq_setbit(struct runq *rq, int pri) > rqb->rqb_bits[RQB_WORD(pri)] |= RQB_BIT(pri); > } > > +SDT_PROVIDER_DECLARE(sched); > +SDT_PROBE_DEFINE3(sched, , , runq_add, "struct thread *", "int", "int"); > + > /* > * Add the thread to the queue specified by its priority, and set the > * corresponding status bit. > @@ -378,6 +381,7 @@ runq_add(struct runq *rq, struct thread *td, int flags) > } else { > TAILQ_INSERT_TAIL(rqh, td, td_runq); > } > + SDT_PROBE3(sched, , , runq_add, td, flags, pri); > } > > void > @@ -396,6 +400,7 @@ runq_add_pri(struct runq *rq, struct thread *td, > u_char pri, int flags) > } else { > TAILQ_INSERT_TAIL(rqh, td, td_runq); > } > + SDT_PROBE3(sched, , , runq_add, td, flags, pri); > } > /* > * Return true if there are runnable processes of any priority on the run > > and used this one-liner: > > dtrace -b 16M -x aggsize=32M -x dynvarsize=32M -n 'sched:::runq_add > /args[0]->td_ucred->cr_uid == 2000/ { self->runq_t = timestamp; } > sched:::on-cpu /self->runq_t/ { @runqlat["runq", execname == > "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - > self->runq_t); self->runq_t = 0; } sched:::on-cpu > /curthread->td_ucred->cr_uid == 2000/ { self->oncpu_t = timestamp; } > sched:::off-cpu /self->oncpu_t/ { @oncpu["oncpu", execname == > "cpuburner-long" ? "cpuburner" : "rest"] = quantize(timestamp - > self->oncpu_t); } tick-300s { exit(0); } > > caped at 5 minutes because dtrace starts running into aggregation drops > > Key takeaway: 4bsd let's the cpu hog stay on cpu for longer than ule, > but when it is taken off, it waits a long time to get back. in > contrast, it gets back on very quick with ule and it is everyone else > waiting big time. > > times in nanoseconds > > 4bsd: > runq cpuburner > value ------------- Distribution ------------- count > 2048 | 0 > 4096 | 2 > 8192 |@@@@@@@@@@ 4343 > 16384 |@@@@@ 2159 > 32768 |@ 363 > 65536 |@@@ 1359 > 131072 | 215 > 262144 | 129 > 524288 | 101 > 1048576 | 132 > 2097152 | 185 > 4194304 |@ 390 > 8388608 |@ 318 > 16777216 |@ 398 > 33554432 |@@ 838 > 67108864 |@@@@@@@ 3025 > 134217728 |@@@@@ 2474 > 268435456 |@@@ 1552 > 536870912 | 153 > 1073741824 | 0 > > runq rest > value ------------- Distribution ------------- count > 2048 | 0 > 4096 | 637 > 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 364832 > 16384 |@@@ 52982 > 32768 |@ 11613 > 65536 |@@ 34286 > 131072 |@@@ 39734 > 262144 |@@ 23261 > 524288 |@ 21985 > 1048576 |@ 18999 > 2097152 |@ 10789 > 4194304 | 6239 > 8388608 | 4725 > 16777216 | 4598 > 33554432 | 4050 > 67108864 | 5857 > 134217728 | 3979 > 268435456 | 2024 > 536870912 | 2011 > 1073741824 | 1105 > 2147483648 | 841 > 4294967296 | 519 > 8589934592 | 372 > 17179869184 | 133 > 34359738368 | 37 > 68719476736 | 30 > 137438953472 | 1 > 274877906944 | 0 > > oncpu cpuburner > value ------------- Distribution ------------- count > 2048 | 0 > 4096 | 20 > 8192 | 10 > 16384 | 19 > 32768 | 161 > 65536 | 137 > 131072 | 77 > 262144 | 104 > 524288 | 147 > 1048576 |@ 301 > 2097152 |@ 482 > 4194304 |@@@ 1178 > 8388608 |@@@@@@@@@@@@@@@ 6971 > 16777216 |@ 474 > 33554432 |@ 669 > 67108864 |@@@@@@@@@@@@@@@@ 7299 > 134217728 | 14 > 268435456 | 38 > 536870912 | 38 > 1073741824 | 2 > 2147483648 | 2 > 4294967296 | 0 > > oncpu rest > value ------------- Distribution ------------- count > 512 | 0 > 1024 | 102 > 2048 |@@@@@@@@@@@ 146555 > 4096 |@@ 23373 > 8192 |@@@ 45165 > 16384 |@ 14531 > 32768 |@@@@@ 67664 > 65536 |@@@@@@ 80155 > 131072 |@@@@@ 64609 > 262144 |@@ 21509 > 524288 |@@ 23393 > 1048576 |@@ 21058 > 2097152 |@ 7854 > 4194304 |@ 8788 > 8388608 |@ 20242 > 16777216 | 2352 > 33554432 | 2084 > 67108864 | 4388 > 134217728 | 1123 > 268435456 | 755 > 536870912 | 135 > 1073741824 | 2 > 2147483648 | 0 > > ule: > runq cpuburner > value ------------- Distribution ------------- count > 2048 | 0 > 4096 | 2 > 8192 |@@@@@@@@@@@@@@@@@@@@@@@@ 37193 > 16384 |@@@@@ 8612 > 32768 |@ 1481 > 65536 |@@@@@ 8430 > 131072 |@@ 3102 > 262144 |@ 938 > 524288 | 457 > 1048576 |@ 2063 > 2097152 | 257 > 4194304 | 41 > 8388608 | 428 > 16777216 | 12 > 33554432 | 1 > 67108864 | 2 > 134217728 | 0 > > runq rest > value ------------- Distribution ------------- count > 4096 | 0 > 8192 |@ 649 > 16384 |@@@@ 1953 > es > 32768 |@ 539 > 65536 |@@@@@ 2369 > 131072 |@@@@ 1904 > 262144 |@ 471 > 524288 | 131 > 1048576 |@ 458 > 2097152 |@ 443 > 4194304 |@ 547 > 8388608 |@ 632 > 16777216 |@@ 984 > 33554432 |@@@ 1606 > 67108864 |@@@@@@@@ 3894 > 134217728 |@ 511 > 268435456 |@ 489 > 536870912 |@ 445 > 1073741824 |@ 475 > 2147483648 |@ 314 > 4294967296 | 188 > 8589934592 | 136 > 17179869184 | 100 > 34359738368 | 81 > 68719476736 | 39 > 137438953472 | 3 > 274877906944 | 0 > > oncpu cpuburner > value ------------- Distribution ------------- count > 2048 | 0 > 4096 | 13 > 8192 | 311 > 16384 |@@ 2369 > 32768 |@ 853 > 65536 |@ 2302 > 131072 |@ 1302 > 262144 | 518 > 524288 |@ 872 > 1048576 |@ 1172 > 2097152 |@ 1435 > 4194304 |@@ 2706 > 8388608 |@@@@@@@@@@@@@@@@@@@ 29669 > 16777216 |@@ 2733 > 33554432 |@@ 3910 > 67108864 |@@@@@@ 9991 > 134217728 |@ 1800 > 268435456 |@ 840 > 536870912 | 200 > 1073741824 | 15 > 2147483648 | 7 > 4294967296 | 1 > 8589934592 | 0 > > oncpu rest > value ------------- Distribution ------------- count > 512 | 0 > 1024 |@ 419 > 2048 |@@@@@@@@@ 5730 > 4096 |@ 778 > 8192 |@@@@@ 3233 > 16384 |@@ 1400 > 32768 |@@@@ 2739 > 65536 |@@@@@@@@@@@@ 7550 > 131072 |@@@ 1720 > 262144 | 151 > 524288 | 112 > 1048576 |@@@ 1688 > 2097152 | 174 > 4194304 | 86 > 8388608 |@ 411 > 16777216 | 7 > 33554432 | 3 > 67108864 | 0 > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff while niced to 20 vs make -j buildkernel I had a little more look here, slapped in some hacks as a POC and got an improvement from 67 minutes above to 21. Hacks are: 1. limit hog timeslice to 1 tick so that is more eager to bail 2. always preempt if pri < cpri So far I can confidently state the general problem: ULE penalizes non-cpu hogs for blocking (even if it is not their fault, so to speak) and that bumps their prio past preemption threshold, at which point they can't preempt said hogs (despite hogs having a higher priority). At the same time hogs use their full time slices, while non-hogs get off cpu very early and have to wait a long time to get back on, at least in part due to inability to preempt said hogs. As I mentioned elsewhere in the thread, interactivity scoring takes "voluntary off cpu time" into account. As literally anything but getting preempted counts as "voluntary sleep", workers get shafted for going off cpu while waiting on locks in the kernel. If I/O needs to happen and the thread waits for the result, it most likely does it early in its timeslice and once it's all ready it waits for background hogs to get off cpu -- it can't preempt them. All that said: 1. "interactivity scoring" (see sched_interact_score) I don't know if it makes any sense to begin with. Even if it does, it counts stuff it should not by not differentiating between deliberately going off cpu (e.g., actual sleep) vs just waiting for a file being read. Imagine firefox reading a file from disk and being considered less interactive for it. I don't have a solution for this problem. I *suspect* the way to go would be to explicitly mark xorg/wayland/whatever as "interactive" and have it inherited by its offspring. At the same time it should not follow to stuff spawned in terminals. Not claiming this is perfect, but it does eliminate the guessing game. Even so, 4BSD does not have any mechanism of the sort and reportedly remains usable on a desktop just by providing some degree of fairness. Given that, I suspect the short term solution would whack said scoring altogether and focus on fairness (see below). 2. fairness As explained above doing any offcpu-time inducing work instantly shafts threads versus cpu hogs, even if said hogs are niced way above them. Here I *suspect* position to add in the runqueue should be related to how much slice was left when the thread went off cpu, while making sure that hogs get to run eventually. Not that I have a nice way of implementing this -- maybe a separate queue for known hogs and picking them every n turns or similar. -- Mateusz Guzik