From owner-freebsd-stable@freebsd.org Mon Apr 9 10:07:52 2018 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1B75AF95C55 for ; Mon, 9 Apr 2018 10:07:52 +0000 (UTC) (envelope-from se@freebsd.org) Received: from mailout07.t-online.de (mailout07.t-online.de [194.25.134.83]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mailout00.t-online.de", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 958196C035 for ; Mon, 9 Apr 2018 10:07:51 +0000 (UTC) (envelope-from se@freebsd.org) Received: from fwd06.aul.t-online.de (fwd06.aul.t-online.de [172.20.26.150]) by mailout07.t-online.de (Postfix) with SMTP id 3FE33420CFFE; Mon, 9 Apr 2018 12:07:49 +0200 (CEST) Received: from Stefans-MBP-7.fritz.box (bHeiVQZc8hgkglkBRjzC9xU0QHIZZH4ijVCQNKQItTi0cyeWFcDdPUONjj+4Z3-Z+Q@[84.154.99.226]) by fwd06.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1f5TiF-1lhPfs0; Mon, 9 Apr 2018 12:07:47 +0200 Subject: Re: more data: SCHED_ULE+PREEMPTION is the problem To: freebsd-stable@freebsd.org References: From: Stefan Esser Openpgp: preference=signencrypt Autocrypt: addr=se@freebsd.org; prefer-encrypt=mutual; keydata= xsBNBFVxiRIBCADOLNOZBsqlplHUQ3tG782FNtVT33rQli9EjNt2fhFERHIo4NxHlWBpHLnU b0s4L/eItx7au0i7Gegv01A9LUMwOnAc9EFAm4EW3Wmoa6MYrcP7xDClohg/Y69f7SNpEs3x YATBy+L6NzWZbJjZXD4vqPgZSDuMcLU7BEdJf0f+6h1BJPnGuwHpsSdnnMrZeIM8xQ8PPUVQ L0GZkVojHgNUngJH6e21qDrud0BkdiBcij0M3TCP4GQrJ/YMdurfc8mhueLpwGR2U1W8TYB7 4UY+NLw0McThOCLCxXflIeF/Y7jSB0zxzvb/H3LWkodUTkV57yX9IbUAGA5RKRg9zsUtABEB AAHNLlN0ZWZhbiBFw59lciAoVC1PbmxpbmUpIDxzdC5lc3NlckB0LW9ubGluZS5kZT7CwH8E EwEIACkFAlhtTvQCGwMFCQWjmoAHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRBH67Xv Wv31RAn0B/9skuajrZxjtCiaOFeJw9l8qEOSNF6PKMN2i/wosqNK57yRQ9AS18x4+mJKXQtc mwyejjQTO9wasBcniKMYyUiie3p7iGuFR4kSqi4xG7dXKjMkYvArWH5DxeWBrVf94yPDexEV FnEG9t1sIXjL17iFR8ng5Kkya5yGWWmikmPdtZChj9OUq4NKHKR7/HGM2dxP3I7BheOwY9PF 4mhqVN2Hu1ZpbzzJo68N8GGBmpQNmahnTsLQ97lsirbnPWyMviWcbzfBCocI9IlepwTCqzlN FMctBpLYjpgBwHZVGXKucU+eQ/FAm+6NWatcs7fpGr7dN99S8gVxnCFX1Lzp/T1YzsBNBFVx iRIBCACxI/aglzGVbnI6XHd0MTP05VK/fJub4hHdc+LQpz1MkVnCAhFbY9oecTB/togdKtfi loavjbFrb0nJhJnx57K+3SdSuu+znaQ4SlWiZOtXnkbpRWNUeMm+gtTDMSvloGAfr76RtFHs kdDOLgXsHD70bKuMhlBxUCrSwGzHaD00q8iQPhJZ5itb3WPqz3B4IjiDAWTO2obD1wtAvSuH uUj/XJRsiKDKW3x13cfavkad81bZW4cpNwUv8XHLv/vaZPSAly+hkY7NrDZydMMXVNQ7AJQu fWuTJ0q7sImRcEZ5EIa98esJPey4O7C0vY405wjeyxpVZkpqThDMurqtQFn1ABEBAAHCwGUE GAEKAA8FAlVxiRICGwwFCQWjmoAACgkQR+u171r99UQEHAf/ZxNbMxwX1v/hXc2ytE6yCAil piZzOffT1VtS3ET66iQRe5VVKL1RXHoIkDRXP7ihm3WF7ZKy9yA9BafMmFxsbXR3+2f+oND6 nRFqQHpiVB/QsVFiRssXeJ2f0WuPYqhpJMFpKTTW/wUWhsDbytFAKXLLfesKdUlpcrwpPnJo KqtVbWAtQ2/o3y+icYOUYzUig+CHl/0pEPr7cUhdDWqZfVdRGVIk6oy00zNYYUmlkkVoU7MB V5D7ZwcBPtjs254P3ecG42szSiEo2cvY9vnMTCIL37tX0M5fE/rHub/uKfG2+JdYSlPJUlva RS1+ODuLoy1pzRd907hl8a7eaVLQWA== Cc: Jeff Roberson Message-ID: <07279919-3b8f-3415-559f-6e7e66cb51c9@freebsd.org> Date: Mon, 9 Apr 2018 12:07:46 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 8bit X-ID: bHeiVQZc8hgkglkBRjzC9xU0QHIZZH4ijVCQNKQItTi0cyeWFcDdPUONjj+4Z3-Z+Q X-TOI-MSGID: 844b33f7-960a-488b-a313-5a452df14973 X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.25 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 09 Apr 2018 10:07:52 -0000 Am 07.04.18 um 16:18 schrieb Peter: > 3. kern.sched.preempt_thresh > > I could make the problem disappear by changing kern.sched.preempt_thresh  from > the default 80 to either 11 (i5-3570T) or 7 (p3) or smaller. This seems to > correspond to the disk interrupt threads, which run at intr:12 (i5-3570T) or > intr:8 (p3). [CC added to include Jeff as the author of the ULE scheduler ...] Since I had somewhat similar problems on my systems (with 4 Quad-Core with SMT enabled, i.e. 8 threads of execution) with compute bound processes keeping I/O intensive processes from running (load average of 12 with 8 "CPUs"), and these problems where affected by preempt_thresh, I checked how this variable is used in the scheduler. The code is in /sys/kern/sched_ule.c. It controls, whether a thread that has become runnable (e.g., after waiting for disk I/O to complete) will preempt the thread currently running on "this" CPU (i.e. the one executing this test in the kernel). IMHO, sched_preempt should default to a much higher number than 80 (e.g. 190), but maybe I misunderstand some of the details ... static inline int sched_shouldpreempt(int pri, int cpri, int remote) { The parameters are: pri: the priority if the now runnable thread cpri: the priority of the thread that currently runs on "this" CPU remote: whether to consider preempting a thread on another CPU The priority values are those displayed by top or ps -l as "PRI", but with an offset of 100 applied (i.e. pri=120 is displayed as PRI=20 in top). If this thread has less priority than the currently executing one (cpri), the currently running thread will not be preempted: /* * If the new priority is not better than the current priority there is * nothing to do. */ if (pri >= cpri) return (0); If the current thread is the idle thread, it will always be preempted by the now runnable thread: /* * Always preempt idle. */ if (cpri >= PRI_MIN_IDLE) return (1); A value of preempt_thresh=0 (e.g. if "options PREEMPTION" is missing in the kernel config) lets the previously running thread continue (except if was the idle thread, which has been dealt with above). The compute bound thread may continue until its quantum has expired. /* * If preemption is disabled don't preempt others. */ if (preempt_thresh == 0) return (0); For any other value of preempt_thresh, the new priority of the thread that just has become runnable will be compared to preempt_thresh and if this new priority is higher (lower numeric value) or equal to preempt_thresh, the thread for which (e.g.) disk I/O finished will preempt the current thread: /* * Preempt if we exceed the threshold. */ if (pri <= preempt_thresh) return (1); ===> This is the only condition that depends on preempt_thresh > 0 <=== The flag "remote" controls whether this thread will be scheduled to run, if its priority is higher or equal to PRI_MAX_INTERACT (less than or equal to 151) and if the opposite is true for the currently running thread (cpri). The value of remote will always be 0 on kernels built without "options SMP". /* * If we're interactive or better and there is non-interactive * or worse running preempt only remote processors. */ if (remote && pri <= PRI_MAX_INTERACT && cpri > PRI_MAX_INTERACT) return (1); The critical use of preempt_thresh is marked above. If it is 0, no preemption will occur. On a single processor system, this should allow the CPU bound thread to run for as long its quantum lasts. A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound thread to preempt any other thread with lower priority (cpri > pri). But in case of a high priority kernel thread being active during this test (with a low numeric cpri value), the I/O bound process will not preempt that higher priority thread (i.e. some high priority kernel thread). Whether the I/O bound thread will run (instead of the compute bound) after the higher priority thread has given up the CPU, will depend on the scheduler decision which thread to select. And for "timeshare" threads, this will often not be the higher priority (I/O bound) thread, but the compute bound thread, which then may execute until next being interrupted by the I/O bound thread (which will not happen, if no new I/O has been requested). This might explain, why setting preempt_thresh to a very low value (in the range of real-time kernel threads) enforces preemption of the CPU bound thread, while any higher (numeric) value of preempt_thresh prevents this and makes tdq_choose() often select the low priority CPU bound over the higher priority I/O bound thread. BUT the first test in sched_shouldpreempt() should prevent any user process from ever preempting a real-time thread "if (pri >= cpri) return 0;". For preemption to occur, pri must be numerically lower than cpri, and pri must be numerically lower than or equal to preempt_thresh. > a. with kern.sched.preempt_thresh=80 > > $ lz4 DATABASE_TEST_FILE /dev/null & while true; >   do ps -o pid,pri,"%cpu",command -p 2119,$! >   sleep 3 > done > [1] 6073 [...] >  PID PRI %CPU COMMAND > 6073  52  6.5 lz4 DATABASE_TEST_FILE /dev/null > 2119  99 91.5 -bash (bash) The I/O bound thread does not preempt the compute bound thread, when becoming runnable (data arrived from disk). With the value of preempt_thresh=80 (corresponding to PRI=-20) only real-time threads may cause preemption, the I/O bound thread can not (PRI=52 / pri=152). A value of preempt_thresh in the range of 190 (corresponding to PRI=90) should allow the lz4 process to preempt the CPU bound process (with higher pri/PRI). > b. with kern.sched.preempt_thresh=11 > >  PID PRI %CPU COMMAND > 4920  21  0.0 lz4 DATABASE_TEST_FILE /dev/null > 2119 101 93.5 -bash (bash) [...] >  PID PRI %CPU COMMAND > 4920  85 43.0 lz4 DATABASE_TEST_FILE /dev/null > 2119  85 45.5 -bash (bash) Such a low preempt_thresh does not allow any user process to preempt any other one (except when running with temporarily increased priority in the kernel). Only a kernel thread (soft interrupt?) at might cause preemption, and if the interrupt is due to a read issued by the I/O bound thread completing, then the I/O bound process is not the one being preempted. This will make the timeshare scheduler select the process with higher priority (lower PRI) that did not recently run (i.e. the I/O bound process, if both have the same PRI), when the kernel thread goes to sleep. But (if my analysis is correct) this indicates, that preempt_thresh set to an extremely low value just helps by accident. The kernel thread interrupts the CPU bound thread, and the I/O bound thread is selected as the next runnable thread in the time-share run queue, either because of its lower PRI value or because it did not run last before the preemption occurred (with equal PRI for both). But, in fact, the same scheduler selection should have occured in test (a), too, if e.g. a soft interrupt preempts the compute bound thread. Not sure, why this does not happen ... (And this may be an indication, that I do not fully understand what's going on ;-) ...) > From this we can see that in case b. both processes balance out nicely and > meet at equal CPU shares. > Whereas in case a., after about 10 Seconds (the first 3 records) they move to > opposite ends of the scale and stay there. > > From this I might suppose that here is some kind of mis-calculation or > mis-adjustment of the task priorities happening. I'd be interested in your results with preempt_thresh set to a value of e.g. 190. The PRI=85 values in your test case (b) correspond to pri=185, and with preempt_thresh slightly higher that that, the lz4 process should still get a 50% share of the CPU. (If its PRI grows over that of the CPU bound process, it will not be able to preempt it, so its PRI should match the one of the CPU bound process). Regards, STefan