Date: Wed, 4 Apr 2018 12:39:24 +0200 From: Alban Hertroys <haramrae@gmail.com> To: Peter <pmc@citylink.dinoex.sub.org> Cc: freebsd-stable@FreeBSD.ORG Subject: Re: kern.sched.quantum: Creepy, sadistic scheduler Message-ID: <9FDC510B-49D0-4722-B695-6CD38CA20D4A@gmail.com> In-Reply-To: <pa17m7$82t$1@oper.dinoex.de> References: <pa17m7$82t$1@oper.dinoex.de>
next in thread | previous in thread | raw e-mail | index | archive | help
> On 4 Apr 2018, at 2:52, Peter <pmc@citylink.dinoex.sub.org> wrote: >=20 > Occasionally I noticed that the system would not quickly process the > tasks i need done, but instead prefer other, longrunning tasks. I > figured it must be related to the scheduler, and decided it hates me. If it hated you, it would behave much worse. > A closer look shows the behaviour as follows (single CPU): A single CPU? That's becoming rare! Is that a VM? Old hardware? = Something really specific? > Lets run an I/O-active task, e.g, postgres VACUUM that would And you're running a multi-process database server on it no less. That = is going to hurt, no matter how well the scheduler works. > continuousely read from big files (while doing compute as well [1]): > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 1.58K 0 12.9M 0 >=20 > Now start an endless loop: > # while true; do :; done >=20 > And the effect is: > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 9 0 76.8K 0 >=20 > The VACUUM gets almost stuck! This figures with WCPU in "top": >=20 > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU COMMAND > >85583 root 99 0 7044K 1944K RUN 1:06 92.21% bash > >53005 pgsql 52 0 620M 91856K RUN 5:47 0.50% = postgres >=20 > Hacking on kern.sched.quantum makes it quite a bit better: > # sysctl kern.sched.quantum=3D1 > kern.sched.quantum: 94488 -> 7874 >=20 > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 395 0 3.12M 0 >=20 > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU COMMAND > >85583 root 94 0 7044K 1944K RUN 4:13 70.80% bash > >53005 pgsql 52 0 276M 91856K RUN 5:52 11.83% = postgres >=20 >=20 > Now, as usual, the "root-cause" questions arise: What exactly does > this "quantum"? Is this solution a workaround, i.e. actually something > else is wrong, and has it tradeoff in other situations? Or otherwise, > why is such a default value chosen, which appears to be ill-deceived? >=20 > The docs for the quantum parameter are a bit unsatisfying - they say > its the max num of ticks a process gets - and what happens when > they're exhausted? If by default the endless loop is actually allowed > to continue running for 94k ticks (or 94ms, more likely) = uninterrupted, > then that explains the perceived behaviour - buts thats certainly not > what a scheduler should do when other procs are ready to run. I can answer this from the operating systems course I followed recently. = This does not apply to FreeBSD specifically, it is general job = scheduling theory. I still need to read up on SCHED_ULE to see how the = details were implemented there. Or are you using the older SCHED_4BSD? Anyway... Jobs that are ready to run are collected on a ready queue. Since you = have a single CPU, there can only be a single job active on the CPU. = When that job is finished, the scheduler takes the next job in the ready = queue and assigns it to the CPU, etc. Now, that would cause a much worse situation in your example case. The = endless loop would keep running once it gets the CPU and would never = release it. No other process would ever get a turn again. You wouldn't = even be able to get into such a system in that state using remote ssh. That is why the scheduler has this "quantum", which limits the maximum = time the CPU will be assigned to a specific job. Once the quantum has = expired (with the job unfinished), the scheduler removes the job from = the CPU, puts it back on the ready queue and assigns the next job from = that queue to the CPU. That's why you seem to get better performance with a smaller value for = the quantum; the endless loop gets forcibly interrupted more often. This changing of the active job however, involves a context switch for = the CPU. Memory, registers, file handles, etc. that were required by the = previous job needs to be put aside and replaced by any such resources = related to the new job to be run. That uses up time and does nothing to = progress the jobs that are waiting for the CPU. Hence, you don't want = the quantum to be too small either, or you'll end up spending = significant time switching contexts. That gets worse when the job = involves system calls, which are handled by the kernel, which is also a = process that needs to be switched (and Meltdown made that worse, because = more rigorous clean-up is necessary to prevent peeks into sections of = memory that were owned by the kernel process previously). The "correct" value for the quantum depends on your type of workload. = PostgreSQL's auto-vacuum is a typical background process that will = probably (I didn't verify) request to be run at a lower priority, giving = other, more important, jobs more chance to get picked from the ready = queue (provided that the OS implements priority for the ready queue). That is probably why your endless loop gets much more CPU time than the = VACUUM process. It may be that FreeBSD's default value for the quantum = is not suitable for your workload. Finding the one best suited to you is = not particularly easy though - perhaps FreeBSD allows access to average = job times (below quantum) that can be used to calculate a reasonable = average from. That said, SCHED_ULE (the default scheduler for quite a while now) was = designed with multi-CPU configurations in mind and there are claims that = SCHED_4BSD works better for single-CPU configurations. You may give that = a try, if you're not already on SCHED_4BSD. A much better option in your case would be to put the database on a = multi-core machine. > [1] > A pure-I/O job without compute load, like "dd", does not show > this behaviour. Also, when other tasks are running, the unjust > behaviour is not so stongly pronounced. That is probably because dd has the decency to give the reins back to = the scheduler at regular intervals. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?9FDC510B-49D0-4722-B695-6CD38CA20D4A>