From owner-freebsd-stable@freebsd.org  Wed Apr  4 10:39:28 2018
Return-Path: <owner-freebsd-stable@freebsd.org>
Delivered-To: freebsd-stable@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id AAE8CF8254E
 for <freebsd-stable@mailman.ysv.freebsd.org>;
 Wed,  4 Apr 2018 10:39:28 +0000 (UTC)
 (envelope-from haramrae@gmail.com)
Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com
 [IPv6:2a00:1450:400c:c09::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 36436716D3
 for <freebsd-stable@freebsd.org>; Wed,  4 Apr 2018 10:39:28 +0000 (UTC)
 (envelope-from haramrae@gmail.com)
Received: by mail-wm0-x22f.google.com with SMTP id r191so8963689wmg.4
 for <freebsd-stable@freebsd.org>; Wed, 04 Apr 2018 03:39:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=5tghv9bmtJmabjMfN2bVkEGTNkGOIp9h7iCJQvxUJYQ=;
 b=LnjYS/qwWmHtxm1KFljilagifcbts1w/ZD+FcgiU9n2jPHST9peeP2zQomBxo8eWvu
 zdrWZZGp8uzumsUWhqdcyOZSN4wKW3Heqr6ZiYlhK2Z59syT0SMQCVvjgsfs3VBSMoMV
 k79TWQqnNFZNOxyLJpejpwgE1BYUeghT5IG0/XGY0W4rrzc2kco3ArowtzwVQenp8oFu
 Bpuj6jICTbeIuteJZNwEuiyJ3TM0+Aj9nMeaR/RdBlkVKDPfuQBwbEIsKHw3nRjLlIHS
 e6YzAKZ45juEqdte2l8UG8H+vAqVxMInAMayMOiInHVyLsMZp7Ap/jAjRtuUflVqXz2Y
 uZVg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=5tghv9bmtJmabjMfN2bVkEGTNkGOIp9h7iCJQvxUJYQ=;
 b=ei6PIvJjMJHngHepiLLPRkwUxCZtS9lVL2vrtd7iyfJj04puFWO+UbFr5onO7VoeW3
 vAPvfElcLSQDKI0B3hEd75O1ZkX0ifZU8riD+1up1j3M1jnsIeWo8SHq3rjeUK1Ih8SB
 ncVXR6UsrEY4d1J6p+alg9Rq5l4au2Thp+vARayTHXrNm24Q4rLpzr5J6uDjFuYMVoAA
 3vikEMKtxVqXHIqsfLyugJaXGcLSMEOZ934JyXSUwwwJRT7F6lKlJ0PLjRN/oFqheKnx
 AYTRpz+Crs+3wmFCWFM2Xe+hE/nFuzASt/O8R8Ff+dVhcZ+AEuk3s7L0mtz7//XRFXR/
 gghA==
X-Gm-Message-State: AElRT7EgWsiMNXzQx4X+3nJPNQI9aXCVqSkeuO1aFB1mASbxy0n5aJLe
 WpSYl0US2KU/ojc1NLd6R91yPAir
X-Google-Smtp-Source: AIpwx48GEdlg30b0cTQQPasVc7uP0yh9u4eQnBojlseNHO1kWJBBzZnyCV1yNYW5f0n7D2qsoWUJaA==
X-Received: by 10.80.177.81 with SMTP id l17mr20109757edd.65.1522838366933;
 Wed, 04 Apr 2018 03:39:26 -0700 (PDT)
Received: from hollewijn.internal (84-245-13-9.dsl.cambrium.nl. [84.245.13.9])
 by smtp.gmail.com with ESMTPSA id
 f21sm1514658edm.37.2018.04.04.03.39.26
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Wed, 04 Apr 2018 03:39:26 -0700 (PDT)
Content-Type: text/plain;
	charset=us-ascii
Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\))
Subject: Re: kern.sched.quantum: Creepy, sadistic scheduler
From: Alban Hertroys <haramrae@gmail.com>
In-Reply-To: <pa17m7$82t$1@oper.dinoex.de>
Date: Wed, 4 Apr 2018 12:39:24 +0200
Cc: freebsd-stable@FreeBSD.ORG
Content-Transfer-Encoding: quoted-printable
Message-Id: <9FDC510B-49D0-4722-B695-6CD38CA20D4A@gmail.com>
References: <pa17m7$82t$1@oper.dinoex.de>
To: Peter <pmc@citylink.dinoex.sub.org>
X-Mailer: Apple Mail (2.3445.5.20)
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.25
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-stable>, 
 <mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable/>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
 <mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 04 Apr 2018 10:39:28 -0000


> On 4 Apr 2018, at 2:52, Peter <pmc@citylink.dinoex.sub.org> wrote:
>=20
> Occasionally I noticed that the system would not quickly process the
> tasks i need done, but instead prefer other, longrunning tasks. I
> figured it must be related to the scheduler, and decided it hates me.

If it hated you, it would behave much worse.

> A closer look shows the behaviour as follows (single CPU):

A single CPU? That's becoming rare! Is that a VM? Old hardware? =
Something really specific?

> Lets run an I/O-active task, e.g, postgres VACUUM that would

And you're running a multi-process database server on it no less. That =
is going to hurt, no matter how well the scheduler works.

> continuousely read from big files (while doing compute as well [1]):
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G  1.58K      0  12.9M      0
>=20
> Now start an endless loop:
> # while true; do :; done
>=20
> And the effect is:
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G      9      0  76.8K      0
>=20
> The VACUUM gets almost stuck! This figures with WCPU in "top":
>=20
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        99    0  7044K  1944K RUN      1:06  92.21% bash
> >53005 pgsql       52    0   620M 91856K RUN      5:47   0.50% =
postgres
>=20
> Hacking on kern.sched.quantum makes it quite a bit better:
> # sysctl kern.sched.quantum=3D1
> kern.sched.quantum: 94488 -> 7874
>=20
> >pool        alloc   free   read  write   read  write
> >cache           -      -      -      -      -      -
> >  ada1s4    7.08G  10.9G    395      0  3.12M      0
>=20
> >  PID USERNAME   PRI NICE   SIZE    RES STATE    TIME    WCPU COMMAND
> >85583 root        94    0  7044K  1944K RUN      4:13  70.80% bash
> >53005 pgsql       52    0   276M 91856K RUN      5:52  11.83% =
postgres
>=20
>=20
> Now, as usual, the "root-cause" questions arise: What exactly does
> this "quantum"? Is this solution a workaround, i.e. actually something
> else is wrong, and has it tradeoff in other situations? Or otherwise,
> why is such a default value chosen, which appears to be ill-deceived?
>=20
> The docs for the quantum parameter are a bit unsatisfying - they say
> its the max num of ticks a process gets - and what happens when
> they're exhausted? If by default the endless loop is actually allowed
> to continue running for 94k ticks (or 94ms, more likely) =
uninterrupted,
> then that explains the perceived behaviour - buts thats certainly not
> what a scheduler should do when other procs are ready to run.

I can answer this from the operating systems course I followed recently. =
This does not apply to FreeBSD specifically, it is general job =
scheduling theory. I still need to read up on SCHED_ULE to see how the =
details were implemented there. Or are you using the older SCHED_4BSD?
Anyway...

Jobs that are ready to run are collected on a ready queue. Since you =
have a single CPU, there can only be a single job active on the CPU. =
When that job is finished, the scheduler takes the next job in the ready =
queue and assigns it to the CPU, etc.

Now, that would cause a much worse situation in your example case. The =
endless loop would keep running once it gets the CPU and would never =
release it. No other process would ever get a turn again. You wouldn't =
even be able to get into such a system in that state using remote ssh.

That is why the scheduler has this "quantum", which limits the maximum =
time the CPU will be assigned to a specific job. Once the quantum has =
expired (with the job unfinished), the scheduler removes the job from =
the CPU, puts it back on the ready queue and assigns the next job from =
that queue to the CPU.
That's why you seem to get better performance with a smaller value for =
the quantum; the endless loop gets forcibly interrupted more often.

This changing of the active job however, involves a context switch for =
the CPU. Memory, registers, file handles, etc. that were required by the =
previous job needs to be put aside and replaced by any such resources =
related to the new job to be run. That uses up time and does nothing to =
progress the jobs that are waiting for the CPU. Hence, you don't want =
the quantum to be too small either, or you'll end up spending =
significant time switching contexts. That gets worse when the job =
involves system calls, which are handled by the kernel, which is also a =
process that needs to be switched (and Meltdown made that worse, because =
more rigorous clean-up is necessary to prevent peeks into sections of =
memory that were owned by the kernel process previously).

The "correct" value for the quantum depends on your type of workload. =
PostgreSQL's auto-vacuum is a typical background process that will =
probably (I didn't verify) request to be run at a lower priority, giving =
other, more important, jobs more chance to get picked from the ready =
queue (provided that the OS implements priority for the ready queue).
That is probably why your endless loop gets much more CPU time than the =
VACUUM process. It may be that FreeBSD's default value for the quantum =
is not suitable for your workload. Finding the one best suited to you is =
not particularly easy though - perhaps FreeBSD allows access to average =
job times (below quantum) that can be used to calculate a reasonable =
average from.

That said, SCHED_ULE (the default scheduler for quite a while now) was =
designed with multi-CPU configurations in mind and there are claims that =
SCHED_4BSD works better for single-CPU configurations. You may give that =
a try, if you're not already on SCHED_4BSD.

A much better option in your case would be to put the database on a =
multi-core machine.

> [1]
> A pure-I/O job without compute load, like "dd", does not show
> this behaviour. Also, when other tasks are running, the unjust
> behaviour is not so stongly pronounced.

That is probably because dd has the decency to give the reins back to =
the scheduler at regular intervals.

Alban Hertroys
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.