From owner-freebsd-arch@FreeBSD.ORG Tue Sep 28 15:43:21 2004 Return-Path: Delivered-To: freebsd-arch@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 692CA16A4CE for ; Tue, 28 Sep 2004 15:43:21 +0000 (GMT) Received: from mail4.speakeasy.net (mail4.speakeasy.net [216.254.0.204]) by mx1.FreeBSD.org (Postfix) with ESMTP id 2B10B43D54 for ; Tue, 28 Sep 2004 15:43:21 +0000 (GMT) (envelope-from jhb@FreeBSD.org) Received: (qmail 27103 invoked from network); 28 Sep 2004 15:43:20 -0000 Received: from dsl027-160-063.atl1.dsl.speakeasy.net (HELO server.baldwin.cx) ([216.27.160.63]) (envelope-sender ) encrypted SMTP for ; 28 Sep 2004 15:43:19 -0000 Received: from [10.50.40.210] (gw1.twc.weather.com [216.133.140.1]) (authenticated bits=0) by server.baldwin.cx (8.12.11/8.12.11) with ESMTP id i8SFhF2M002165; Tue, 28 Sep 2004 11:43:15 -0400 (EDT) (envelope-from jhb@FreeBSD.org) From: John Baldwin To: freebsd-arch@FreeBSD.org Date: Tue, 28 Sep 2004 10:56:00 -0400 User-Agent: KMail/1.6.2 References: <1096133353.53798.17613.camel@palm.tree.com> <200409271443.22667.jhb@FreeBSD.org> <1096320486.3733.58.camel@palm.tree.com> In-Reply-To: <1096320486.3733.58.camel@palm.tree.com> MIME-Version: 1.0 Content-Disposition: inline Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Message-Id: <200409281056.00870.jhb@FreeBSD.org> X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on server.baldwin.cx cc: "freebsd-arch@freebsd.org" cc: Julian Elischer cc: Stephan Uphoff Subject: Re: sched_userret priority adjustment patch for sched_4bsd X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 28 Sep 2004 15:43:21 -0000 On Monday 27 September 2004 05:28 pm, Stephan Uphoff wrote: > On Mon, 2004-09-27 at 14:43, John Baldwin wrote: > > On Monday 27 September 2004 01:20 pm, Stephan Uphoff wrote: > > > On Mon, 2004-09-27 at 10:16, John Baldwin wrote: > > > > On Saturday 25 September 2004 01:29 pm, Stephan Uphoff wrote: > > > > > When a thread is about to return to user space it resets its > > > > > priority to the user level priority. > > > > > However after lowering the permission its priority it needs to > > > > > check if its priority is still better than all other runable > > > > > threads. This is currently not implemented. > > > > > Without the check the thread can block kernel or user threads with > > > > > better priority until a switch is forced by by an interrupt. > > > > > > > > > > The attached patch checks the relevant runqueues and threads > > > > > without slots in the same ksegrp and forces a thread switch if the > > > > > currently running thread is no longer the best thread to run after > > > > > it changed its priority. > > > > > > > > > > The patch should improve interactive response under heavy load > > > > > somewhat. It needs a lot of testing. > > > > > > > > Perhaps the better fix is to teach the schedulers to set > > > > TDF_NEEDRESCHED based on on a comparison against user_pri rather than > > > > td_priority inside of sched_add()? Having the flag set by > > > > sched_add() is supposed to make this sort of check unnecessary. Even > > > > 4.x has the same bug I think as a process can make another process > > > > runnable after it's priority has been boosted by a tsleep() and > > > > need_resched() is only called based on a comparison of p_pri. Ah, 4.x > > > > doesn't have the bug because it caches the priority of curproc when > > > > it enters the kernel and compares against that. Thus, I think the > > > > correct fix is more like this: > > > > > > > > Index: sched_4bsd.c > > > > =================================================================== > > > > RCS file: /usr/cvs/src/sys/kern/sched_4bsd.c,v > > > > retrieving revision 1.63 > > > > diff -u -r1.63 sched_4bsd.c > > > > --- sched_4bsd.c 11 Sep 2004 10:07:22 -0000 1.63 > > > > +++ sched_4bsd.c 27 Sep 2004 14:12:03 -0000 > > > > @@ -272,7 +272,7 @@ > > > > { > > > > > > > > mtx_assert(&sched_lock, MA_OWNED); > > > > - if (td->td_priority < curthread->td_priority) > > > > + if (td->td_priority < curthread->td_ksegrp->kg_user_pri) > > > > curthread->td_flags |= TDF_NEEDRESCHED; > > > > } > > > > > > > > Index: sched_ule.c > > > > =================================================================== > > > > RCS file: /usr/cvs/src/sys/kern/sched_ule.c,v > > > > retrieving revision 1.129 > > > > diff -u -r1.129 sched_ule.c > > > > --- sched_ule.c 11 Sep 2004 10:07:22 -0000 1.129 > > > > +++ sched_ule.c 27 Sep 2004 14:13:01 -0000 > > > > @@ -723,7 +723,7 @@ > > > > */ > > > > pcpu = pcpu_find(cpu); > > > > td = pcpu->pc_curthread; > > > > - if (ke->ke_thread->td_priority < td->td_priority || > > > > + if (ke->ke_thread->td_priority < td->td_ksegrp->kg_user_pri > > > > || td == pcpu->pc_idlethread) { > > > > td->td_flags |= TDF_NEEDRESCHED; > > > > ipi_selected(1 << cpu, IPI_AST); > > > > > > > > An even better fix might be to fix td_base_pri by having it be set on > > > > kernel entry similar to how 4.x sets curpriority. The above fix > > > > should be sufficient for now, however. > > > > > > I don't think that this is enough since TDF_NEEDRESCHED is thread > > > specific and not cpu specific. > > > > Hmm, it is CPU specific in 4.x. It could be changed back to being a > > per-cpu flag easily. > > But this might not help :-(. > Example: > > Thread A is running in the kernel and is preempted by an interrupt > Thread I. Thread I wakes up thread B. > If I->td_ksegrp->kg_user_pri <= B->td_priority TDF_NEEDRESCHED will not > be set. > If A->td_priority < B->td_priority thread A will run once I is finished > serving interrupts. > Thread A can now leave the kernel also A->td_ksegrp->kg_user_pri > > B->td_priority may be true. If A has a priority boost from tsleep() this is intentional, however. The priroity boosts from tsleep() are _supposed_ to do this so as to favor interactive tasks. Note that if you add the code to always raise td_priority while in the kernel as below you may end up defeating this well-known feature of the 4BSD scheduler. > > > However the thread marked with TDF_NEEDRESCHED might not be the next > > > thread leaving the kernel. > > > ( Can't really talk about ULE since I am trying to avoid looking at > > > another shiny irresistible time sink this week ;-) > > > > > > I think we agree that that td_priority should be set to td_base_pri on > > > kernel entry. Since td_base_pri is changed by sleep and condvar > > > functions it should also be reset on kernel entry. (Probably from a new > > > ksegrp field). Condvar waits should currently non cause the base > > > priority to change to the current priority of the thread - otherwise > > > td_base_pri could get stuck at a really bad user priority. > > > ( td->td_base_pri might end up being worse than > > > td->td_ksegrp->kg_user_pri when the ksegrp priority improves) > > > > Well, I think instead that td_base_pri should be set to td_priority on > > kernel entry (rather than the other way around). td_priority should be > > unchanged just because it enters the kernel. > > I guess we disagree here. > There are just to many resource dependencies in the kernel that can lead > to priority inversion (vnode locks, disk buffer ownership, etc). > It would be nice to delay the priority boost until a thread acquires > such a resource (or even trace resource dependencies and implement > priority inheritance) ... but this would be a huge task. > Boosting the priority on kernel entry is easy and less error prone. I > guess we had this discussion last week and we just disagree on the > issue. Well, I think you don't understand exactly what td_base_pri is supposed to do. :) If you want to boost td_priority on kernel entry that is fine, but is completely orthogonal to this discussion. If you wanted to that then you might do something like this: kernel_entry() { if (td->td_priority > PRI_KERN_MAX) sched_prio(td, PRI_KERN_MAX); td_base_pri = td->td_priority; } i.e., just add the boost to the kernel_entry function. > > I think the sleep functions could then > > leave td_base_pri alone. (I think setting it there is wrong because > > td_base_pri is not quite the same as curpriority in 4.x.) What > > td_base_pri is really supposed to provide, btw, is the priority that the > > thread should go back to once it has unlocked a mutex and had its priorty > > boosted while it held the mutex. Arguably it should just be using > > kg_user_pri for this, but then you loose priority "boosts" from tsleep(), > > which is why td_base_pri is set in msleep(). I guess what should happen > > is something more like this: > > > > kernel_entry() > > { > > KASSERT(td->td_priority == td->td_ksegrp->kg_user_pri); > > td->td_base_pri = td->td_priority; > > } > > > > msleep() > > { > > sched_prio(...); > > td_base_pri = td->td_priority; > > } > > > > The TDF_NEEDRESCHED checks should be using kg_user_pri as in my previous > > e-mail. Also, in sched_prio(), if our priority is ever raised > > (numerically, logically less important), we should set TDF_NEEDRESCHED > > since we may need to switch (4.x does this in maybe_needresched()). > > Then, TDF_NEEDRESCHED could become a per-cpu flag and have it not be > > cleared in mi_switch() but be cleared only in userret(). Hmm, I think > > all of the TDF_NEEDRESCHED handling actually beings in sched_userret() > > btw. > > Wouldn't this lead to unnecessary round-robin switches between threads > with the same priority on sched_4bsd? It can on 4.x then. The idea is that the occasional spurious context switch is cheaper than doing a lot of work on each kernel exit. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" = http://www.FreeBSD.org