Date: Sat, 9 Jun 2018 13:53:48 +0200 From: Stefan Esser <se@freebsd.org> To: Andriy Gapon <avg@FreeBSD.org>, FreeBSD Current <freebsd-current@freebsd.org> Cc: "M. Warner Losh" <imp@freebsd.org> Subject: Re: Is kern.sched.preempt_thresh=0 a sensible default? Message-ID: <b9925356-dd68-32a1-c9fb-441b694c0ccf@freebsd.org> In-Reply-To: <bd122dbb-a708-dbc4-838b-3e1784921eff@FreeBSD.org> References: <dc8d0285-1916-6581-2b2d-e8320ec3d894@freebsd.org> <CANCZdfoieekesqKa5RmOp=z2vycsVqnVss7ROnO87YTV-qBUzA@mail.gmail.com> <1d188cb0-ebc8-075f-ed51-57641ede1fd6@freebsd.org> <49fa8de4-e164-0642-4e01-a6188992c32e@freebsd.org> <32d6305b-3d57-4d37-ba1b-51631e994520@FreeBSD.org> <93efc3e1-7ac3-fedc-a71e-66c99f8e8c1e@freebsd.org> <9aaec961-e604-303a-52f3-ee24e3a435d0@FreeBSD.org> <bd122dbb-a708-dbc4-838b-3e1784921eff@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Am 07.06.18 um 19:14 schrieb Andriy Gapon: > On 03/05/2018 12:41, Andriy Gapon wrote: >> I think that we need preemption policies that might not be expressible as one or >> two numbers. A policy could be something like this: >> - interrupt threads can preempt only threads from "lower" classes: real-time, >> kernel, timeshare, idle; >> - interrupt threads cannot preempt other interrupt threads >> - real-time threads can preempt other real-time threads and threads from "lower" >> classes: kernel, timeshare, idle >> - kernel threads can preempt only threads from lower classes: timeshare, idle >> - interactive timeshare threads can only preempt batch and idle threads >> - batch threads can only preempt idle threads > > Here is a sketch of the idea: https://reviews.freebsd.org/D15693 Hi Andriy, I highly appreciate your effort to improve the scheduling in SCHED_ULE. But I'm afraid, that your scheme will not fix the problem. As you may know, there are a number of problems with SCHED_ULE, which let quite a number of users prefer SCHED_4BSD even on multi-core systems. The problems I'm aware of: 1) On UP systems, I/O intensive applications may be starved by compute intensive processes that are allowed to consume their full quantum of time (limiting reads to some 10 per second worst case). 2) Similarly, on SMP systems with load higher than the number of cores (virtual cores in case of HT), the compute bound cores can slow down a cp of a large file from 100s of MB/s to 100s of KB/s, under certain circumstances. 3) Programs that evenly split the load on all available cores have been suffering from sub-optimal assignment of threads to cores. E.g. on a CPU with 8 (virtual) cores, this resulted in 6 cores running the load in nominal time, 1 core taking twice as long because 2 threads were scheduled to run on it, while 1 core was mostly idle. Even if the load was initially evenly distributed, a woken up process that ran on one core destroyed the symmetry and it was not recovered. (This was a problem e.g. for parallel programs using MPI or the like.) 4) The real time behavior of SCHED_ULE is weak due to interactive processes (e.g. the X server) being put into the "time-share" class and then suffering from the problems described as 1) or 2) above. (You distinguish time-share and batch processes, which both are allowed to consume their full quanta even of a higher priority process in their class becomes runnable. I think this will not give the required responsiveness e.g. for an X server.) They should be considered I/O intensive, if they often don't use their full quantum, without taking the significant amount of CPU time they may use at times into account. (I.e. the criterion for time-sharing should not be the CPU time consumed, but rather some fraction of the quanta not being fully used due to voluntarily giving up the CPU.) With many real-time threads it may be hard to identify interactive threads, since they are non-voluntarily disrupted too often - this must be considered in the sampling of voluntary vs. non-voluntary context switches. 5) The NICE parameter has hardly any effect on the scheduling. Processes started with nice 19 get nearly the same share of the CPU as processes at nice 0, while they should traditionally only run when a core was idle, otherwise. Nice values between 0 and 19 have even less effect (hardly any). I have not had time to try the patch in that review, but I think that the cause of scheduling problems is not localized in that function. And a solution should be based on typical use cases or sample scenarios being applied to a scheduling policy. There are some easy cases (e.g. a "random" load of independent processes like a parallel make run), where only cache effects are relevant (try to keep a thread on its CPU as long as possible and, if interrupted, continue it on that CPU if you can assume there is still significant cached state). There have been excessive KTR traces that showed the scheduler behavior under specific loads, especially MPI, and there have been attempts to fix the uneven distribution of processes for that case (but AFAIR not with good success). Your patches may be part of the solution, with at least 3 other parts remaining: 1) The classification of interactive and time-share should be separate. Interactive means that the process does not use its full quantum in a non-negligible fraction of cases. The X server or a DBMS server should not be considered compute intensive, or request rates will be as low as 10 per second (if the time-share quantum is in the order of 100 ms). 2) The scheduling should guarantee symmetric distribution of the load for scenarios as parallel programs with MPI. Since OpenMP and other mechanism have similar requirements, this will become more relevant over time. 3) The nice-ness of a process should be relevant, to give the user or admin a way to adjust priorities. Each of those points will require changes in different parts of the scheduler, but I think those changes should not be considered in isolation. Best regards, STefan
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?b9925356-dd68-32a1-c9fb-441b694c0ccf>