From owner-freebsd-current@freebsd.org Sat Jun 9 17:15:20 2018 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1CFD11010B35 for ; Sat, 9 Jun 2018 17:15:20 +0000 (UTC) (envelope-from se@freebsd.org) Received: from mailout06.t-online.de (mailout06.t-online.de [194.25.134.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mailout00.t-online.de", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 1732077953; Sat, 9 Jun 2018 11:53:59 +0000 (UTC) (envelope-from se@freebsd.org) Received: from fwd35.aul.t-online.de (fwd35.aul.t-online.de [172.20.27.145]) by mailout06.t-online.de (Postfix) with SMTP id 2DE9641C52F7; Sat, 9 Jun 2018 13:53:51 +0200 (CEST) Received: from Stefans-MBP-23.fritz.box (rCYz4+Zcgh9PIrZU88+4RFENdVoNZX9BDwdaA+C1A5BxNOH-4mYL+5jMDkgN9RzgSA@[84.154.104.94]) by fwd35.t-online.de with (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384 encrypted) esmtp id 1fRcRJ-1O0BMW0; Sat, 9 Jun 2018 13:53:49 +0200 Subject: Re: Is kern.sched.preempt_thresh=0 a sensible default? To: Andriy Gapon , FreeBSD Current Cc: "M. Warner Losh" References: <1d188cb0-ebc8-075f-ed51-57641ede1fd6@freebsd.org> <49fa8de4-e164-0642-4e01-a6188992c32e@freebsd.org> <32d6305b-3d57-4d37-ba1b-51631e994520@FreeBSD.org> <93efc3e1-7ac3-fedc-a71e-66c99f8e8c1e@freebsd.org> <9aaec961-e604-303a-52f3-ee24e3a435d0@FreeBSD.org> From: Stefan Esser Openpgp: preference=signencrypt Autocrypt: addr=se@freebsd.org; prefer-encrypt=mutual; keydata= xsBNBFVxiRIBCADOLNOZBsqlplHUQ3tG782FNtVT33rQli9EjNt2fhFERHIo4NxHlWBpHLnU b0s4L/eItx7au0i7Gegv01A9LUMwOnAc9EFAm4EW3Wmoa6MYrcP7xDClohg/Y69f7SNpEs3x YATBy+L6NzWZbJjZXD4vqPgZSDuMcLU7BEdJf0f+6h1BJPnGuwHpsSdnnMrZeIM8xQ8PPUVQ L0GZkVojHgNUngJH6e21qDrud0BkdiBcij0M3TCP4GQrJ/YMdurfc8mhueLpwGR2U1W8TYB7 4UY+NLw0McThOCLCxXflIeF/Y7jSB0zxzvb/H3LWkodUTkV57yX9IbUAGA5RKRg9zsUtABEB AAHNKVN0ZWZhbiBFw59lciAoWWFob28hKSA8c3QuZXNzZXJAeWFob28uZGU+wsCWBBMBCgBA AhsDBwsJCAcDAgEGFQgCCQoLBBYCAwECHgECF4AWIQSjceplnAvsyCtxUxNH67XvWv31RAUC WvLvqwUJCyUBEwAKCRBH67XvWv31REySCACc6vqcSFQCRyBRc2CV5ZBjbbnTy7VBoXbUS3/c 4Hn8I0YQ39q7//2z8vYsgLeM1mMXL4PUIU/0f0dBAFBLpxV7bntGzyCJls6SeGS/qcQKhqaI 6I7NcWg8OkIJIhUL6q238cS1ql9pU65fyHe0PP8JS08m81PDpX2/4wTE6h2jgYUy55eXRzoF MEjr1S8SSnidsBem27o7iWu9ltJsUtE86071iZlLzbuHv2nvucrjAV9cK9tHrxYT/YiY8QhT L48iWj2xIjLjg1ebmgIFZ2k881we/KTIoUugqOOR1gDSc4qwM8CA388cN3frjtl98CwhAT5T UV8tIDqri+/Z1AKwzsBNBFVxiRIBCACxI/aglzGVbnI6XHd0MTP05VK/fJub4hHdc+LQpz1M kVnCAhFbY9oecTB/togdKtfiloavjbFrb0nJhJnx57K+3SdSuu+znaQ4SlWiZOtXnkbpRWNU eMm+gtTDMSvloGAfr76RtFHskdDOLgXsHD70bKuMhlBxUCrSwGzHaD00q8iQPhJZ5itb3WPq z3B4IjiDAWTO2obD1wtAvSuHuUj/XJRsiKDKW3x13cfavkad81bZW4cpNwUv8XHLv/vaZPSA ly+hkY7NrDZydMMXVNQ7AJQufWuTJ0q7sImRcEZ5EIa98esJPey4O7C0vY405wjeyxpVZkpq ThDMurqtQFn1ABEBAAHCwHwEGAEKACYCGwwWIQSjceplnAvsyCtxUxNH67XvWv31RAUCWvLv qwUJCyUBGQAKCRBH67XvWv31RLnrB/9gzcRlpx71sDMosoZULWn7wysBJ/8AIEfIByRaHQe3 pn/KwE57pB+zFbbQqB7YzeZb7/UUgR4zU2ZbOcEfwDZcHUbj0B3fGRsS3t0uiLlAd8w0sBZb SxrqzjdpDjIbOZkxssqUmvrsN67UG1AFWH9aD24keBS7YjPBS8hLxPeYV+Xz6vUL8fRZje/Z JgiBMIwyj6g2lH/zkdnxBdC0iG1xxJOLTaghMMeQyCdH6ef8+VMyAlAJsMckbOTvx63tY8z7 DFcrnTJfbe1EziRilVsEaK8tTzJzhcTfos+f3eBYWEilxe5HzIhYKJeC7lmsSUcGwa6+9VRg a0ctmi9Z8OgX Message-ID: Date: Sat, 9 Jun 2018 13:53:48 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Language: en-US Content-Transfer-Encoding: 7bit X-ID: rCYz4+Zcgh9PIrZU88+4RFENdVoNZX9BDwdaA+C1A5BxNOH-4mYL+5jMDkgN9RzgSA X-TOI-MSGID: 62d0694b-abd3-4c82-8ade-eac6f48c8717 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.26 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 09 Jun 2018 17:15:20 -0000 Am 07.06.18 um 19:14 schrieb Andriy Gapon: > On 03/05/2018 12:41, Andriy Gapon wrote: >> I think that we need preemption policies that might not be expressible as one or >> two numbers. A policy could be something like this: >> - interrupt threads can preempt only threads from "lower" classes: real-time, >> kernel, timeshare, idle; >> - interrupt threads cannot preempt other interrupt threads >> - real-time threads can preempt other real-time threads and threads from "lower" >> classes: kernel, timeshare, idle >> - kernel threads can preempt only threads from lower classes: timeshare, idle >> - interactive timeshare threads can only preempt batch and idle threads >> - batch threads can only preempt idle threads > > Here is a sketch of the idea: https://reviews.freebsd.org/D15693 Hi Andriy, I highly appreciate your effort to improve the scheduling in SCHED_ULE. But I'm afraid, that your scheme will not fix the problem. As you may know, there are a number of problems with SCHED_ULE, which let quite a number of users prefer SCHED_4BSD even on multi-core systems. The problems I'm aware of: 1) On UP systems, I/O intensive applications may be starved by compute intensive processes that are allowed to consume their full quantum of time (limiting reads to some 10 per second worst case). 2) Similarly, on SMP systems with load higher than the number of cores (virtual cores in case of HT), the compute bound cores can slow down a cp of a large file from 100s of MB/s to 100s of KB/s, under certain circumstances. 3) Programs that evenly split the load on all available cores have been suffering from sub-optimal assignment of threads to cores. E.g. on a CPU with 8 (virtual) cores, this resulted in 6 cores running the load in nominal time, 1 core taking twice as long because 2 threads were scheduled to run on it, while 1 core was mostly idle. Even if the load was initially evenly distributed, a woken up process that ran on one core destroyed the symmetry and it was not recovered. (This was a problem e.g. for parallel programs using MPI or the like.) 4) The real time behavior of SCHED_ULE is weak due to interactive processes (e.g. the X server) being put into the "time-share" class and then suffering from the problems described as 1) or 2) above. (You distinguish time-share and batch processes, which both are allowed to consume their full quanta even of a higher priority process in their class becomes runnable. I think this will not give the required responsiveness e.g. for an X server.) They should be considered I/O intensive, if they often don't use their full quantum, without taking the significant amount of CPU time they may use at times into account. (I.e. the criterion for time-sharing should not be the CPU time consumed, but rather some fraction of the quanta not being fully used due to voluntarily giving up the CPU.) With many real-time threads it may be hard to identify interactive threads, since they are non-voluntarily disrupted too often - this must be considered in the sampling of voluntary vs. non-voluntary context switches. 5) The NICE parameter has hardly any effect on the scheduling. Processes started with nice 19 get nearly the same share of the CPU as processes at nice 0, while they should traditionally only run when a core was idle, otherwise. Nice values between 0 and 19 have even less effect (hardly any). I have not had time to try the patch in that review, but I think that the cause of scheduling problems is not localized in that function. And a solution should be based on typical use cases or sample scenarios being applied to a scheduling policy. There are some easy cases (e.g. a "random" load of independent processes like a parallel make run), where only cache effects are relevant (try to keep a thread on its CPU as long as possible and, if interrupted, continue it on that CPU if you can assume there is still significant cached state). There have been excessive KTR traces that showed the scheduler behavior under specific loads, especially MPI, and there have been attempts to fix the uneven distribution of processes for that case (but AFAIR not with good success). Your patches may be part of the solution, with at least 3 other parts remaining: 1) The classification of interactive and time-share should be separate. Interactive means that the process does not use its full quantum in a non-negligible fraction of cases. The X server or a DBMS server should not be considered compute intensive, or request rates will be as low as 10 per second (if the time-share quantum is in the order of 100 ms). 2) The scheduling should guarantee symmetric distribution of the load for scenarios as parallel programs with MPI. Since OpenMP and other mechanism have similar requirements, this will become more relevant over time. 3) The nice-ness of a process should be relevant, to give the user or admin a way to adjust priorities. Each of those points will require changes in different parts of the scheduler, but I think those changes should not be considered in isolation. Best regards, STefan