From owner-freebsd-stable@freebsd.org Wed Apr 4 10:39:28 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AAE8CF8254E for ; Wed, 4 Apr 2018 10:39:28 +0000 (UTC) (envelope-from haramrae@gmail.com) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 36436716D3 for ; Wed, 4 Apr 2018 10:39:28 +0000 (UTC) (envelope-from haramrae@gmail.com) Received: by mail-wm0-x22f.google.com with SMTP id r191so8963689wmg.4 for ; Wed, 04 Apr 2018 03:39:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=5tghv9bmtJmabjMfN2bVkEGTNkGOIp9h7iCJQvxUJYQ=; b=LnjYS/qwWmHtxm1KFljilagifcbts1w/ZD+FcgiU9n2jPHST9peeP2zQomBxo8eWvu zdrWZZGp8uzumsUWhqdcyOZSN4wKW3Heqr6ZiYlhK2Z59syT0SMQCVvjgsfs3VBSMoMV k79TWQqnNFZNOxyLJpejpwgE1BYUeghT5IG0/XGY0W4rrzc2kco3ArowtzwVQenp8oFu Bpuj6jICTbeIuteJZNwEuiyJ3TM0+Aj9nMeaR/RdBlkVKDPfuQBwbEIsKHw3nRjLlIHS e6YzAKZ45juEqdte2l8UG8H+vAqVxMInAMayMOiInHVyLsMZp7Ap/jAjRtuUflVqXz2Y uZVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=5tghv9bmtJmabjMfN2bVkEGTNkGOIp9h7iCJQvxUJYQ=; b=ei6PIvJjMJHngHepiLLPRkwUxCZtS9lVL2vrtd7iyfJj04puFWO+UbFr5onO7VoeW3 vAPvfElcLSQDKI0B3hEd75O1ZkX0ifZU8riD+1up1j3M1jnsIeWo8SHq3rjeUK1Ih8SB ncVXR6UsrEY4d1J6p+alg9Rq5l4au2Thp+vARayTHXrNm24Q4rLpzr5J6uDjFuYMVoAA 3vikEMKtxVqXHIqsfLyugJaXGcLSMEOZ934JyXSUwwwJRT7F6lKlJ0PLjRN/oFqheKnx AYTRpz+Crs+3wmFCWFM2Xe+hE/nFuzASt/O8R8Ff+dVhcZ+AEuk3s7L0mtz7//XRFXR/ gghA== X-Gm-Message-State: AElRT7EgWsiMNXzQx4X+3nJPNQI9aXCVqSkeuO1aFB1mASbxy0n5aJLe WpSYl0US2KU/ojc1NLd6R91yPAir X-Google-Smtp-Source: AIpwx48GEdlg30b0cTQQPasVc7uP0yh9u4eQnBojlseNHO1kWJBBzZnyCV1yNYW5f0n7D2qsoWUJaA== X-Received: by 10.80.177.81 with SMTP id l17mr20109757edd.65.1522838366933; Wed, 04 Apr 2018 03:39:26 -0700 (PDT) Received: from hollewijn.internal (84-245-13-9.dsl.cambrium.nl. [84.245.13.9]) by smtp.gmail.com with ESMTPSA id f21sm1514658edm.37.2018.04.04.03.39.26 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 04 Apr 2018 03:39:26 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: kern.sched.quantum: Creepy, sadistic scheduler From: Alban Hertroys In-Reply-To: Date: Wed, 4 Apr 2018 12:39:24 +0200 Cc: freebsd-stable@FreeBSD.ORG Content-Transfer-Encoding: quoted-printable Message-Id: <9FDC510B-49D0-4722-B695-6CD38CA20D4A@gmail.com> References: To: Peter X-Mailer: Apple Mail (2.3445.5.20) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Apr 2018 10:39:28 -0000 > On 4 Apr 2018, at 2:52, Peter wrote: >=20 > Occasionally I noticed that the system would not quickly process the > tasks i need done, but instead prefer other, longrunning tasks. I > figured it must be related to the scheduler, and decided it hates me. If it hated you, it would behave much worse. > A closer look shows the behaviour as follows (single CPU): A single CPU? That's becoming rare! Is that a VM? Old hardware? = Something really specific? > Lets run an I/O-active task, e.g, postgres VACUUM that would And you're running a multi-process database server on it no less. That = is going to hurt, no matter how well the scheduler works. > continuousely read from big files (while doing compute as well [1]): > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 1.58K 0 12.9M 0 >=20 > Now start an endless loop: > # while true; do :; done >=20 > And the effect is: > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 9 0 76.8K 0 >=20 > The VACUUM gets almost stuck! This figures with WCPU in "top": >=20 > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU COMMAND > >85583 root 99 0 7044K 1944K RUN 1:06 92.21% bash > >53005 pgsql 52 0 620M 91856K RUN 5:47 0.50% = postgres >=20 > Hacking on kern.sched.quantum makes it quite a bit better: > # sysctl kern.sched.quantum=3D1 > kern.sched.quantum: 94488 -> 7874 >=20 > >pool alloc free read write read write > >cache - - - - - - > > ada1s4 7.08G 10.9G 395 0 3.12M 0 >=20 > > PID USERNAME PRI NICE SIZE RES STATE TIME WCPU COMMAND > >85583 root 94 0 7044K 1944K RUN 4:13 70.80% bash > >53005 pgsql 52 0 276M 91856K RUN 5:52 11.83% = postgres >=20 >=20 > Now, as usual, the "root-cause" questions arise: What exactly does > this "quantum"? Is this solution a workaround, i.e. actually something > else is wrong, and has it tradeoff in other situations? Or otherwise, > why is such a default value chosen, which appears to be ill-deceived? >=20 > The docs for the quantum parameter are a bit unsatisfying - they say > its the max num of ticks a process gets - and what happens when > they're exhausted? If by default the endless loop is actually allowed > to continue running for 94k ticks (or 94ms, more likely) = uninterrupted, > then that explains the perceived behaviour - buts thats certainly not > what a scheduler should do when other procs are ready to run. I can answer this from the operating systems course I followed recently. = This does not apply to FreeBSD specifically, it is general job = scheduling theory. I still need to read up on SCHED_ULE to see how the = details were implemented there. Or are you using the older SCHED_4BSD? Anyway... Jobs that are ready to run are collected on a ready queue. Since you = have a single CPU, there can only be a single job active on the CPU. = When that job is finished, the scheduler takes the next job in the ready = queue and assigns it to the CPU, etc. Now, that would cause a much worse situation in your example case. The = endless loop would keep running once it gets the CPU and would never = release it. No other process would ever get a turn again. You wouldn't = even be able to get into such a system in that state using remote ssh. That is why the scheduler has this "quantum", which limits the maximum = time the CPU will be assigned to a specific job. Once the quantum has = expired (with the job unfinished), the scheduler removes the job from = the CPU, puts it back on the ready queue and assigns the next job from = that queue to the CPU. That's why you seem to get better performance with a smaller value for = the quantum; the endless loop gets forcibly interrupted more often. This changing of the active job however, involves a context switch for = the CPU. Memory, registers, file handles, etc. that were required by the = previous job needs to be put aside and replaced by any such resources = related to the new job to be run. That uses up time and does nothing to = progress the jobs that are waiting for the CPU. Hence, you don't want = the quantum to be too small either, or you'll end up spending = significant time switching contexts. That gets worse when the job = involves system calls, which are handled by the kernel, which is also a = process that needs to be switched (and Meltdown made that worse, because = more rigorous clean-up is necessary to prevent peeks into sections of = memory that were owned by the kernel process previously). The "correct" value for the quantum depends on your type of workload. = PostgreSQL's auto-vacuum is a typical background process that will = probably (I didn't verify) request to be run at a lower priority, giving = other, more important, jobs more chance to get picked from the ready = queue (provided that the OS implements priority for the ready queue). That is probably why your endless loop gets much more CPU time than the = VACUUM process. It may be that FreeBSD's default value for the quantum = is not suitable for your workload. Finding the one best suited to you is = not particularly easy though - perhaps FreeBSD allows access to average = job times (below quantum) that can be used to calculate a reasonable = average from. That said, SCHED_ULE (the default scheduler for quite a while now) was = designed with multi-CPU configurations in mind and there are claims that = SCHED_4BSD works better for single-CPU configurations. You may give that = a try, if you're not already on SCHED_4BSD. A much better option in your case would be to put the database on a = multi-core machine. > [1] > A pure-I/O job without compute load, like "dd", does not show > this behaviour. Also, when other tasks are running, the unjust > behaviour is not so stongly pronounced. That is probably because dd has the decency to give the reins back to = the scheduler at regular intervals. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.