Date: Tue, 10 Apr 2018 13:22:07 +0200 From: Peter <pmc@citylink.dinoex.sub.org> To: freebsd-stable@FreeBSD.ORG Subject: Re: more data: SCHED_ULE+PREEMPTION is the problem Message-ID: <d9d1ccfe-4f0a-ae87-ac6b-a0ecd6509ccb@citylink.dinoex.sub.org> In-Reply-To: <07279919-3b8f-3415-559f-6e7e66cb51c9@freebsd.org> References: <pa17m7$82t$1@oper.dinoex.de> <paak11$vcf$1@oper.dinoex.de> <07279919-3b8f-3415-559f-6e7e66cb51c9@freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
Hi Stefan, I'm glad to see You're thinking along similar paths as I did. But let me fist answer Your question straight away, and sort out the remainder afterwards. > I'd be interested in your results with preempt_thresh set to a value > of e.g.190. There is no difference. Any value above 7 shows the problem identically. I think this value (or preemtion as a whole) is not the actual cause for the problem; it just changes some conditions that make the problem visible. So, trying to adjust preempt_thresh in order to fix the problem seems to be a dead end. Stefan Esser wrote: > The critical use of preempt_thresh is marked above. If it is 0, no preemption > will occur. On a single processor system, this should allow the CPU bound > thread to run for as long its quantum lasts. I would like to contradict here. From what I understand, preemption is *not* the base of task switching. AFAIK preemption is an additional feature that allows to switch threads while they execute in kernel mode. While executing in user mode, a thread can be interrupted and switched at any time, and that is how the traditional time-sharing systems did it. Traditionally a thread would execute in kernel mode only during interrupts and syscalls, and those last no longer than a few ms, and for long that was not an issue. Only when we got the fast interfaces (10Gbps etc.) and got big monsters executing in kernel space (traffic-shaper, ZFS, etc.), that scheme became problematic and preemption was invented. According to McKusicks book, the scheduler is two-fold: an outer logic runs few times per second and calculates priorities. And an inner logic runs very often (at every interrupt?) and chooses the next runnable thread simply by priority. The meaning of the quantum is then: when it is used up, the thread is moved to the end of it's queue, so that it may take a while until it runs again. This is for implementing round-robin behaviour within a single queue (= a single priority). It should not prevent task-switching as such. Lets have a look. sched_choose() seems to be that low-level scheduler function that decides which thread to run next. Lets create a log of its decisions.[1] With preempt_thresh >= 12 (kernel threads left out): PID COMMAND TIMESTAMP 18196 bash 1192.549 18196 bash 1192.554 18196 bash 1192.559 66683 lz4 1192.560 18196 bash 1192.560 18196 bash 1192.562 18196 bash 1192.563 18196 bash 1192.564 79496 ntpd 1192.569 18196 bash 1192.569 18196 bash 1192.574 18196 bash 1192.579 18196 bash 1192.584 18196 bash 1192.588 18196 bash 1192.589 18196 bash 1192.594 18196 bash 1192.599 18196 bash 1192.604 18196 bash 1192.609 18196 bash 1192.613 18196 bash 1192.614 18196 bash 1192.619 18196 bash 1192.624 18196 bash 1192.629 18196 bash 1192.634 18196 bash 1192.638 18196 bash 1192.639 18196 bash 1192.644 18196 bash 1192.649 18196 bash 1192.654 66683 lz4 1192.654 18196 bash 1192.655 18196 bash 1192.655 18196 bash 1192.659 The worker is indeed called only after 95ms. And with preempt_thresh < 8: PID COMMAND TIMESTAMP 18196 bash 1268.955 66683 lz4 1268.956 18196 bash 1268.956 66683 lz4 1268.956 18196 bash 1268.957 66683 lz4 1268.957 18196 bash 1268.957 66683 lz4 1268.958 18196 bash 1268.958 66683 lz4 1268.959 18196 bash 1268.959 66683 lz4 1268.959 18196 bash 1268.960 66683 lz4 1268.960 18196 bash 1268.961 66683 lz4 1268.961 18196 bash 1268.961 66683 lz4 1268.962 18196 bash 1268.962 Here we have 3 Csw per millisecond. (The fact that the decisions are over-all more frequent is easily explained: when lz4 gets to run, it will do disk I/O, which quickly returns and triggers new decisions.) In the second record, things are clear: while lz4 does disk I/O, the scheduler MUST run bash, because nothing else is there. But when data arrives, it runs again lz4. But in the first record - why does the scheduler choose bash, although lz4 has already much higher prio (52 versus 97, usually)? > A value of 120 (corresponding to PRI=20 in top) will allow the I/O bound > thread to preempt any other thread with lower priority (cpri > pri). But in > case of a high priority kernel thread being active during this test (with a > low numeric cpri value), the I/O bound process will not preempt that higher > priority thread (i.e. some high priority kernel thread). > > Whether the I/O bound thread will run (instead of the compute bound) after > the higher priority thread has given up the CPU, will depend on the scheduler > decision which thread to select. And for "timeshare" threads, this will often > not be the higher priority (I/O bound) thread, but the compute bound thread, > which then may execute until next being interrupted by the I/O bound thread > (which will not happen, if no new I/O has been requested). Exactly. But why does this happen? Following the trail onwards from sched_choose() we find tdq_choose(), and that is quite straightforward looking for something runnable. The realtime and idle queues are simple and should grab the thread with highest priority, but the timeshare queue (which is the interesting piece here) has a special gimmick. It calls kern_sched.c:runq_choose_from() with an extra parameter tdq_ridx ("Current removal index"). From the design papers I get the clue that the heads of the timeshare runqueues are arranged in some kind of circular buffer - this is a specialty of SCHED_ULE; SCHED_4BSD does not have this feature, and does not call runq_choose_from(). And the tdq_ridx is the current index onto that circle. This mechanism (which I do not really understand currently) is the AFAIK only "element of distortion", which may cause that *not* the highest prio thread is selected to run. > For preemption to occur, pri must be numerically lower than cpri, and > pri must be numerically lower than or equal to preempt_thresh. Maybe. But as I wrote above, preemption is only concerning threads running in kernel mode. And what I want, is simply a normal task switching between the two processes running in user mode. > But, in fact, the same scheduler selection should have occured in test (a), > too, if e.g. a soft interrupt preempts the compute bound thread. Not sure, > why this does not happen ... (And this may be an indication, that I do not > fully understand what's going on ;-) ...) I think You do understand it - or we both don't understand it ;) - and there is still something else going on here which we do not yet see. > The PRI=85 values in your test case (b) correspond to pri=185, and with > preempt_thresh slightly higher that that, the lz4 process should still get a > 50% share of the CPU. (If its PRI grows over that of the CPU bound process, > it will not be able to preempt it, so its PRI should match the one of the CPU > bound process). As mentioned above, this does not work in practice. What I did instead was take a look at this circular index tdq_ridx. And what I found there is very strange: That index indeed moves from 0 to 63 (we have only one runq every 4 priorities) and repeats, and it takes 1/2 second to get around. Except when the problem appears - then it suddenly takes more than six seconds to get around! tdq_ridx is defined within sched_ule.c, and it is modified only here: > * Advance the insert index once for each tick to ensure that all > * threads get a chance to run. > */ > if (tdq->tdq_idx == tdq->tdq_ridx) { > tdq->tdq_idx = (tdq->tdq_idx + 1) % RQ_NQS; > if (TAILQ_EMPTY(&tdq->tdq_timeshare.rq_queues[tdq->tdq_ridx])) > tdq->tdq_ridx = tdq->tdq_idx; The comment suggests that tdq_ridx should increment with timer ticks, i.e. with constant speed! And, if the clock is stathz, that does figure: 64 / stathz = 0.5 seconds Here are the logs [2] showing that indeed the speed changes greatly, but only in the situation when the problem appears: preempt_thresh = 0: tdq_ridx timestamp 0 32.948718044 0 33.453705474 0 33.998990897 0 34.503690368 0 35.003946192 0 35.508674863 0 36.016583301 0 36.518663899 preempt_thresh = 80: 0 69.056754055 0 69.561710522 0 70.066654974 0 70.571848226 0 71.076660214 0 71.576621653 0 72.081640256 0 72.586606139 piglet running: (and adding half-round 32 for proof) 0 32.100767726 32 32.369727157 0 32.619581572 32 32.965831183 0 33.220777200 32 33.480751369 0 33.730745086 32 33.980736953 0 34.235749995 32 34.494315832 0 34.744162342 32 35.050767367 0 35.315405397 32 35.560715408 0 35.820902945 32 36.068976384 piglet and worker running: now clock is slow! 32 67.164163127 0 70.368819114 32 73.581467022 0 76.817670952 32 77.195794807 0 77.465528087 32 77.778554940 0 78.644335854 32 82.045835390 0 85.258399007 32 88.478827844 0 91.746418960 32 95.226730336 preempt_thresh back to 0: 32 47.988086885 0 48.255774908 32 48.507861276 0 48.765327585 32 49.011779231 0 49.268056274 32 49.515563800 0 49.815179616 32 50.067504569 0 50.322943146 32 50.572696557 0 50.823302445 32 51.074755232 I do currently not see how (or if) this would lead to the suboptimal selection of a runnable thread - but it might indeed do so. And, if there is really a problem with the clock, this may also have further effects. But then I have no clue what to make of this, or where to track it further. It might be that there is no problem with the scheduler, but instead the happening preemption does something bad with the clock. Or whatever... P.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?d9d1ccfe-4f0a-ae87-ac6b-a0ecd6509ccb>