From owner-freebsd-stable@freebsd.org Tue Apr 10 19:13:22 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3BCCFF913F3 for ; Tue, 10 Apr 2018 19:13:22 +0000 (UTC) (envelope-from li-fbsd@citylink.dinoex.sub.org) Received: from uucp.dinoex.sub.de (uucp.dinoex.sub.de [IPv6:2001:1440:5001:1::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "uucp.dinoex.sub.de", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B0C9382C5B for ; Tue, 10 Apr 2018 19:13:21 +0000 (UTC) (envelope-from li-fbsd@citylink.dinoex.sub.org) Received: from uucp.dinoex.sub.de (uucp.dinoex.sub.de [194.45.71.2]) by uucp.dinoex.sub.de (8.15.2/8.15.2) with ESMTPS id w3AJD6AQ042855 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO) for ; Tue, 10 Apr 2018 21:13:06 +0200 (CEST) (envelope-from li-fbsd@citylink.dinoex.sub.org) X-MDaemon-Deliver-To: Received: from citylink.dinoex.sub.org (uucp@localhost) by uucp.dinoex.sub.de (8.15.2/8.15.2/Submit) with UUCP id w3AJD6EX042854 for freebsd-stable@FreeBSD.ORG; Tue, 10 Apr 2018 21:13:06 +0200 (CEST) (envelope-from li-fbsd@citylink.dinoex.sub.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by citylink.dinoex.sub.de (8.15.2/8.15.2) with ESMTP id w3AIiXLY004690 for ; Tue, 10 Apr 2018 20:44:33 +0200 (CEST) (envelope-from li-fbsd@citylink.dinoex.sub.org) Received: from gate.oper.dinoex.org (gate-e [192.168.98.2]) by gate.oper.dinoex.org (8.15.2/8.15.2) with ESMTP id w3AIi32O004601 for ; Tue, 10 Apr 2018 20:44:03 +0200 (CEST) (envelope-from li-fbsd@citylink.dinoex.sub.org) Received: (from news@localhost) by gate.oper.dinoex.org (8.15.2/8.15.2/Submit) id w3AIi3E2004595 for freebsd-stable@FreeBSD.ORG; Tue, 10 Apr 2018 20:44:03 +0200 (CEST) (envelope-from li-fbsd@citylink.dinoex.sub.org) X-Authentication-Warning: gate.oper.dinoex.org: news set sender to li-fbsd@citylink.dinoex.sub.org using -f From: Peter Subject: Found the issue! - SCHED_ULE+PREEMPTION is the problem Date: Tue, 10 Apr 2018 20:30:35 +0200 Organization: even some more stinky socks Message-ID: References: <07279919-3b8f-3415-559f-6e7e66cb51c9@freebsd.org> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Injection-Info: oper.dinoex.de; logging-data="2839"; mail-complaints-to="usenet@citylink.dinoex.sub.org" User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48 X-Mozilla-News-Host: news://localhost In-Reply-To: Sender: li-fbsd@citylink.dinoex.sub.org To: freebsd-stable@FreeBSD.ORG X-Milter: Spamilter (Reciever: uucp.dinoex.sub.de; Sender-ip: 194.45.71.2; Sender-helo: uucp.dinoex.sub.de; ) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.6.2 (uucp.dinoex.sub.de [194.45.71.2]); Tue, 10 Apr 2018 21:13:07 +0200 (CEST) X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Apr 2018 19:13:22 -0000 Results: 1. The tdq_ridx pointer The perceived slow advance (of the tdq_ridx pointer into the circular array) is correct behaviour. McKusick writes: >The pointer is advanced once per system tick, although it may not >advance on a tick until the currently selected queue is empty. Since >each thread is given a maximum time slice and no threads may be added >to the current position, the queue will drain in a bounded amount of >time. Therefore, it is also normal that the process (the piglet in this case) does run until it's time slice (aka quantum) is used up. 2. The influence of preempt_thresh This can be found in tdq_runq_add(). A simplified description of the logic there is as follows: td_priority < 152 ? -> add to realtime-queue td_priority <= 223 ? -> add to timeshare-queue if preempted circular-index = tdq_ridx else circular_index = tdq_idx + td_priority else -> add to idle-queue If the thread had been preempted, it is reinserted at the current working position of the circular array, otherwise the position is calculated from thread priority. 3. The quorum Most of the task switches come from device interrupts. Those are running at priority intr:8 or intr:12. So, as soon as preempt_thresh is 12 or bigger, the piglet is almost always reinserted in the runqueue due to preemption. And, as we see, in that case we do not have a scheduling, we have a simple resume! A real scheduling happens only after the quorum is exhausted. Therefore, reducing the quorum helps. 4. History In r171713 was this behaviour deliberately introduced. In r220198 it was fixed, with a focus on CPU-hogs and single-CPU. In r239157 the fix was undone due to performance considerations, with the focus on rescheduling only at end of the time-slice. 5. Conclusion The current defaults seem not very well suited for certain CPU-intense tasks. Possible solutions are one of: * not use SCHED_ULE * not use preemption * change kern.sched.quorum to minimal value. P.