Date: Mon, 1 May 2023 03:33:18 +0200 From: Mateusz Guzik <mjguzik@gmail.com> To: Tomoaki AOKI <junchoon@dec.sakura.ne.jp> Cc: freebsd-hackers@freebsd.org Subject: Re: Periodic rant about SCHED_ULE Message-ID: <CAGudoHGA1JkoMoChhNi7kTBu-%2BeUOzcFViF_P0ZC907hkvL5gg@mail.gmail.com> In-Reply-To: <CAGudoHHk7H=FwtmPztmqfrX3uq=ff-ohK4tZMTJod7CfqVTwig@mail.gmail.com> References: <a401e51a-250a-64a0-15cb-ff79bcefbf94@m5p.com> <8173cc7e-e934-dd5c-312a-1dfa886941aa@FreeBSD.org> <8cfdb951-9b1f-ecd3-2291-7a528e1b042c@m5p.com> <c3f5f667-ba0b-c40c-b8a6-19d1c9c63c5f@FreeBSD.org> <ZBtRJhNHluj5Nzyk@troutmask.apl.washington.edu> <CAGudoHEj%2BkoaYhkjzDE5KX9OsCno=X5M_E3z9uwg6Pg7dtqTsA@mail.gmail.com> <CAGudoHHxTT-Cn11zcFB3ZwF76UcRUv=QS28RLgzd=hVehTy0Kg@mail.gmail.com> <CAGudoHGoh30O-3O0jjwevDvP43-ykUt6JUDiwRNW918VZfybhA@mail.gmail.com> <CAGudoHEWfy61XSMhXdYOrKWVotuC0Kc6NSWiaaZCy6aQhbvXoQ@mail.gmail.com> <CAGudoHFPqz_LtsVNnz4P2gyKXz5Z8hU%2Bv6QYGizm4%2BDtZRn8Yg@mail.gmail.com> <20230331215751.166a294f7382c85b545f53a2@dec.sakura.ne.jp> <CAGudoHHk7H=FwtmPztmqfrX3uq=ff-ohK4tZMTJod7CfqVTwig@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On 5/1/23, Mateusz Guzik <mjguzik@gmail.com> wrote: > On 3/31/23, Tomoaki AOKI <junchoon@dec.sakura.ne.jp> wrote: >> On Mon, 27 Mar 2023 16:47:04 +0200 >> Mateusz Guzik <mjguzik@gmail.com> wrote: >> >>> On 3/27/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> > On 3/25/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >> On 3/23/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>> On 3/22/23, Mateusz Guzik <mjguzik@gmail.com> wrote: >>> >>>> On 3/22/23, Steve Kargl <sgk@troutmask.apl.washington.edu> wrote: >>> >>>>> On Wed, Mar 22, 2023 at 07:31:57PM +0100, Matthias Andree wrote: >>> >>>>>> >> >> (snip) >> >>> > >>> > I repeat the setup: 8 cores, 8 processes doing cpu-bound stuff while >>> > niced to 20 vs make -j buildkernel >>> > >>> > I had a little more look here, slapped in some hacks as a POC and got >>> > an improvement from 67 minutes above to 21. >>> > >>> > Hacks are: >>> > 1. limit hog timeslice to 1 tick so that is more eager to bail >>> > 2. always preempt if pri < cpri >>> > >>> > So far I can confidently state the general problem: ULE penalizes >>> > non-cpu hogs for blocking (even if it is not their fault, so to speak) >>> > and that bumps their prio past preemption threshold, at which point >>> > they can't preempt said hogs (despite hogs having a higher priority). >>> > At the same time hogs use their full time slices, while non-hogs get >>> > off cpu very early and have to wait a long time to get back on, at >>> > least in part due to inability to preempt said hogs. >>> > >>> > As I mentioned elsewhere in the thread, interactivity scoring takes >>> > "voluntary off cpu time" into account. As literally anything but >>> > getting preempted counts as "voluntary sleep", workers get shafted for >>> > going off cpu while waiting on locks in the kernel. >>> > >>> > If I/O needs to happen and the thread waits for the result, it most >>> > likely does it early in its timeslice and once it's all ready it waits >>> > for background hogs to get off cpu -- it can't preempt them. >>> > >>> > All that said: >>> > 1. "interactivity scoring" (see sched_interact_score) >>> > >>> > I don't know if it makes any sense to begin with. Even if it does, it >>> > counts stuff it should not by not differentiating between deliberately >>> > going off cpu (e.g., actual sleep) vs just waiting for a file being >>> > read. Imagine firefox reading a file from disk and being considered >>> > less interactive for it. >>> > >>> > I don't have a solution for this problem. I *suspect* the way to go >>> > would be to explicitly mark xorg/wayland/whatever as "interactive" and >>> > have it inherited by its offspring. At the same time it should not >>> > follow to stuff spawned in terminals. Not claiming this is perfect, >>> > but it does eliminate the guessing game. >>> > >>> > Even so, 4BSD does not have any mechanism of the sort and reportedly >>> > remains usable on a desktop just by providing some degree of fairness. >>> > >>> > Given that, I suspect the short term solution would whack said scoring >>> > altogether and focus on fairness (see below). >>> > >>> > 2. fairness >>> > >>> > As explained above doing any offcpu-time inducing work instantly >>> > shafts threads versus cpu hogs, even if said hogs are niced way above >>> > them. >>> > >>> > Here I *suspect* position to add in the runqueue should be related to >>> > how much slice was left when the thread went off cpu, while making >>> > sure that hogs get to run eventually. Not that I have a nice way of >>> > implementing this -- maybe a separate queue for known hogs and picking >>> > them every n turns or similar. >>> > >>> >>> Aight, now that I had a sober look at the code I think I cracked the >>> case. >>> >>> The runq mechanism used by both 4BSD and ULE provides 64(!) queues, >>> where the priority is divided by said number and that's how you know >>> in which queue to land the thread. >>> >>> When deciding what to run, 4BSD uses runq_choose which iterates all >>> queues from the beginning. This means threads of lower priority keep >>> executing before the rest. In particular cpu hog lands with a high >>> priority, looking worse than make -j 8 buildkernel and only running >>> when there is nothing else ready to get the cpu. While this may sound >>> decent, it is bad -- in principle a steady stream of lower priority >>> threads can starve the hogs indefinitely. >>> >>> The problem was recognized when writing ULE, but improperly fixed -- >>> it ends up distributing all threads within given priority range across >>> the queues and then performing a lookup in a given queue. Here the >>> problem is that while technically everyone does get a chance to run, >>> the threads not using full slices are hosed for the time period as >>> they wait for the hog *a lot*. >>> >>> A hack patch to induce the bogus-but-better 4BSD behavior of draining >>> all runqs before running higher prio threads drops down build time to >>> ~9 minutes, which is shorter than 4BSD. >>> >>> However, the right fix would achieve that *without* introducing >>> starvation potential. >>> >>> I also note the runqs are a massive waste of memory and computing >>> power. I'm going to have to sleep on what to do here. >>> >>> For interested here is the hackery: >>> https://people.freebsd.org/~mjg/.junk/ule-poc-hacks-dont-use.diff >>> >>> sysctl kern.sched.slice_nice=0 >>> sysctl kern.sched.preempt_thresh=400 # arbitrary number higher than any >>> prio >>> >>> -- >>> Mateusz Guzik <mjguzik gmail.com> >> >> Thanks for the patch. >> Applied on top of main, amd64 at commit >> 9d33a9d96f5a2cd88d0955b5b56ef5058b1706c1, setup 2 sysctls as you >> mentioned and tested as below >> >> *Play flac files by multimedia/audacious via audio/virtual_oss >> *Running www/firefox (not touched while testing) >> *Forcibly build lang/rust >> *Play games/aisleriot >> >> at the same time. >> games/aisleriot runs slower than the situation lang/rust is not in >> build, but didn't "freeze" and audacious normally played next music on >> playlist, even on lang/rust is building codes written in rust. >> >> This is GREAT advance! >> Without the patch, compiling rust codes eats up almost 100% of ALL >> cores, and games/aisleriot often FREEZES SEVERAL MINUTES, and >> multimedia/audacious needs to wait for, at worst, next music for FEW >> MINUTES. (Once playback starts, the music is played normally until it >> ends.) >> >> But unfortunately, the patch cannot be applied to stable/13, as some >> prerequisite commits are not MFC'ed. >> Missing commits are at least as below. There should be more, as I >> gave up further tracking and haven't actually merged them to test. >> >> commit 954cffe95de1b9d70ed804daa45b7921f0f5c9da [1] >> ule: Simplistic time-sharing for interrupt threads. >> >> commit fea89a2804ad89f5342268a8546a3f9b515b5e6c [2] >> Add sched_ithread_prio to set the base priority of an interrupt >> thread. >> >> commit 85b46073242d4666e1c9037d52220422449f9584 [3] >> Deduplicate bus_dma bounce code. >> >> >> [1] >> https://cgit.freebsd.org/src/commit/?id=954cffe95de1b9d70ed804daa45b7921f0f5c9da >> >> [2] >> https://cgit.freebsd.org/src/commit/?id=fea89a2804ad89f5342268a8546a3f9b515b5e6c >> >> [3] >> https://cgit.freebsd.org/src/commit/?id=85b46073242d4666e1c9037d52220422449f9584 >> >> -- >> Tomoaki AOKI <junchoon@dec.sakura.ne.jp> >> >> > > Hello everyone. > > I sorted out a patch I consider comittable for the time being. IT IS > NOT A PANACEA by any means, but it does sort out the most acute > problem and should be a win for most people. It also comes with a knob > to turn it off. > > That said, can you test this please: > https://people.freebsd.org/~mjg/ule_pickshort.diff > > works against fresh main. if you are worried about recent zfs woes, > just make sure you don't zpool upgrade and will be fine. > Here is an updated patch: https://people.freebsd.org/~mjg/ule_pickshortv2.diff if you are getting bad results, do: sysctl kern.sched.preempt_bottom=0 and try again. thank you. -- Mateusz Guzik <mjguzik gmail.com>
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAGudoHGA1JkoMoChhNi7kTBu-%2BeUOzcFViF_P0ZC907hkvL5gg>