From owner-freebsd-arch@FreeBSD.ORG  Tue Sep 28 15:43:21 2004
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 692CA16A4CE
	for <freebsd-arch@FreeBSD.org>; Tue, 28 Sep 2004 15:43:21 +0000 (GMT)
Received: from mail4.speakeasy.net (mail4.speakeasy.net [216.254.0.204])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 2B10B43D54
	for <freebsd-arch@FreeBSD.org>; Tue, 28 Sep 2004 15:43:21 +0000 (GMT)
	(envelope-from jhb@FreeBSD.org)
Received: (qmail 27103 invoked from network); 28 Sep 2004 15:43:20 -0000
Received: from dsl027-160-063.atl1.dsl.speakeasy.net (HELO server.baldwin.cx)
	([216.27.160.63])          (envelope-sender <jhb@FreeBSD.org>)
	encrypted SMTP
	for <julian@elischer.org>; 28 Sep 2004 15:43:19 -0000
Received: from [10.50.40.210] (gw1.twc.weather.com [216.133.140.1])
	(authenticated bits=0)
	by server.baldwin.cx (8.12.11/8.12.11) with ESMTP id i8SFhF2M002165;
	Tue, 28 Sep 2004 11:43:15 -0400 (EDT)
	(envelope-from jhb@FreeBSD.org)
From: John Baldwin <jhb@FreeBSD.org>
To: freebsd-arch@FreeBSD.org
Date: Tue, 28 Sep 2004 10:56:00 -0400
User-Agent: KMail/1.6.2
References: <1096133353.53798.17613.camel@palm.tree.com>
	<200409271443.22667.jhb@FreeBSD.org> <1096320486.3733.58.camel@palm.tree.com>
In-Reply-To: <1096320486.3733.58.camel@palm.tree.com>
MIME-Version: 1.0
Content-Disposition: inline
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <200409281056.00870.jhb@FreeBSD.org>
X-Spam-Checker-Version: SpamAssassin 2.63 (2004-01-11) on server.baldwin.cx
cc: "freebsd-arch@freebsd.org" <freebsd-arch@FreeBSD.org>
cc: Julian Elischer <julian@elischer.org>
cc: Stephan Uphoff <ups@tree.com>
Subject: Re: sched_userret priority adjustment patch for sched_4bsd
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.1
Precedence: list
List-Id: Discussion related to FreeBSD architecture
	<freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Sep 2004 15:43:21 -0000

On Monday 27 September 2004 05:28 pm, Stephan Uphoff wrote:
> On Mon, 2004-09-27 at 14:43, John Baldwin wrote:
> > On Monday 27 September 2004 01:20 pm, Stephan Uphoff wrote:
> > > On Mon, 2004-09-27 at 10:16, John Baldwin wrote:
> > > > On Saturday 25 September 2004 01:29 pm, Stephan Uphoff wrote:
> > > > > When a thread is about to return to user space it resets its
> > > > > priority to the user level priority.
> > > > > However after lowering the permission its priority it needs to
> > > > > check if its priority is still better than all other runable
> > > > > threads. This is currently not implemented.
> > > > > Without the check the thread can block kernel or user threads with
> > > > > better priority until a switch is forced by by an interrupt.
> > > > >
> > > > > The attached patch checks the relevant runqueues and threads
> > > > > without slots in the same ksegrp and forces a thread switch if the
> > > > > currently running thread is no longer the best thread to run after
> > > > > it changed its priority.
> > > > >
> > > > > The patch should improve interactive response under heavy load
> > > > > somewhat. It needs a lot of testing.
> > > >
> > > > Perhaps the better fix is to teach the schedulers to set
> > > > TDF_NEEDRESCHED based on on a comparison against user_pri rather than
> > > > td_priority inside of sched_add()?  Having the flag set by
> > > > sched_add() is supposed to make this sort of check unnecessary.  Even
> > > > 4.x has the same bug I think as a process can make another process
> > > > runnable after it's priority has been boosted by a tsleep() and
> > > > need_resched() is only called based on a comparison of p_pri. Ah, 4.x
> > > > doesn't have the bug because it caches the priority of curproc when
> > > > it enters the kernel and compares against that. Thus, I think the
> > > > correct fix is more like this:
> > > >
> > > > Index: sched_4bsd.c
> > > > ===================================================================
> > > > RCS file: /usr/cvs/src/sys/kern/sched_4bsd.c,v
> > > > retrieving revision 1.63
> > > > diff -u -r1.63 sched_4bsd.c
> > > > --- sched_4bsd.c        11 Sep 2004 10:07:22 -0000      1.63
> > > > +++ sched_4bsd.c        27 Sep 2004 14:12:03 -0000
> > > > @@ -272,7 +272,7 @@
> > > >  {
> > > >
> > > >         mtx_assert(&sched_lock, MA_OWNED);
> > > > -       if (td->td_priority < curthread->td_priority)
> > > > +       if (td->td_priority < curthread->td_ksegrp->kg_user_pri)
> > > >                 curthread->td_flags |= TDF_NEEDRESCHED;
> > > >  }
> > > >
> > > > Index: sched_ule.c
> > > > ===================================================================
> > > > RCS file: /usr/cvs/src/sys/kern/sched_ule.c,v
> > > > retrieving revision 1.129
> > > > diff -u -r1.129 sched_ule.c
> > > > --- sched_ule.c 11 Sep 2004 10:07:22 -0000      1.129
> > > > +++ sched_ule.c 27 Sep 2004 14:13:01 -0000
> > > > @@ -723,7 +723,7 @@
> > > >          */
> > > >         pcpu = pcpu_find(cpu);
> > > >         td = pcpu->pc_curthread;
> > > > -       if (ke->ke_thread->td_priority < td->td_priority ||
> > > > +       if (ke->ke_thread->td_priority < td->td_ksegrp->kg_user_pri
> > > > || td == pcpu->pc_idlethread) {
> > > >                 td->td_flags |= TDF_NEEDRESCHED;
> > > >                 ipi_selected(1 << cpu, IPI_AST);
> > > >
> > > > An even better fix might be to fix td_base_pri by having it be set on
> > > > kernel entry similar to how 4.x sets curpriority.  The above fix
> > > > should be sufficient for now, however.
> > >
> > > I don't think that this is enough since TDF_NEEDRESCHED is thread
> > > specific and not cpu specific.
> >
> > Hmm, it is CPU specific in 4.x.  It could be changed back to being a
> > per-cpu flag easily.
>
> But this might not help :-(.
> Example:
>
> Thread A is running in the kernel and is preempted by an interrupt
> Thread I. Thread I wakes up thread B.
> If I->td_ksegrp->kg_user_pri <= B->td_priority TDF_NEEDRESCHED will not
> be set.
> If A->td_priority < B->td_priority thread A will run once I is finished
> serving interrupts.
> Thread A can now leave the kernel also A->td_ksegrp->kg_user_pri >
> B->td_priority may be true.

If A has a priority boost from tsleep() this is intentional, however.  The 
priroity boosts from tsleep() are _supposed_ to do this so as to favor 
interactive tasks.  Note that if you add the code to always raise td_priority 
while in the kernel as below you may end up defeating this well-known feature 
of the 4BSD scheduler.

> > > However the thread marked with TDF_NEEDRESCHED might not be the next
> > > thread leaving the kernel.
> > > ( Can't really talk about ULE since I am trying to avoid looking at
> > > another shiny irresistible time sink this week ;-)
> > >
> > > I think we agree that that td_priority should be set to td_base_pri on
> > > kernel entry. Since td_base_pri is changed by sleep and condvar
> > > functions it should also be reset on kernel entry. (Probably from a new
> > > ksegrp field). Condvar waits should currently non cause the base
> > > priority to change to the current priority of the thread - otherwise
> > > td_base_pri could get stuck at a really bad user priority.
> > > ( td->td_base_pri might end up being worse than
> > > td->td_ksegrp->kg_user_pri when the ksegrp priority improves)
> >
> > Well, I think instead that td_base_pri should be set to td_priority on
> > kernel entry (rather than the other way around).  td_priority should be
> > unchanged just because it enters the kernel.
>
> I guess we disagree here.
> There are just to many resource dependencies in the kernel that can lead
> to priority inversion (vnode locks, disk buffer ownership, etc).
> It would be nice to delay the priority boost until a thread acquires
> such a resource (or even trace resource dependencies and implement
> priority inheritance) ... but this would be a huge task.
> Boosting the priority on kernel entry is easy and less error prone. I
> guess we had this discussion last week and we just disagree on the
> issue.

Well, I think you don't understand exactly what td_base_pri is supposed to 
do. :)  If you want to boost td_priority on kernel entry that is fine, but is 
completely orthogonal to this discussion.  If you wanted to that then you 
might do something like this:

kernel_entry()
{
	if (td->td_priority > PRI_KERN_MAX)
		sched_prio(td, PRI_KERN_MAX);
	td_base_pri = td->td_priority;
}

i.e., just add the boost to the kernel_entry function.

> > I think the sleep functions could then
> > leave td_base_pri alone.  (I think setting it there is wrong because
> > td_base_pri is not quite the same as curpriority in 4.x.)  What
> > td_base_pri is really supposed to provide, btw, is the priority that the
> > thread should go back to once it has unlocked a mutex and had its priorty
> > boosted while it held the mutex.  Arguably it should just be using
> > kg_user_pri for this, but then you loose priority "boosts" from tsleep(),
> > which is why td_base_pri is set in msleep().  I guess what should happen
> > is something more like this:
> >
> > kernel_entry()
> > {
> > 	KASSERT(td->td_priority == td->td_ksegrp->kg_user_pri);
> > 	td->td_base_pri = td->td_priority;
> > }
> >
> > msleep()
> > {
> > 	sched_prio(...);
> > 	td_base_pri = td->td_priority;
> > }
> >
> > The TDF_NEEDRESCHED checks should be using kg_user_pri as in my previous
> > e-mail.  Also, in sched_prio(), if our priority is ever raised
> > (numerically, logically less important), we should set TDF_NEEDRESCHED
> > since we may need to switch (4.x does this in maybe_needresched()). 
> > Then, TDF_NEEDRESCHED could become a per-cpu flag and have it not be
> > cleared in mi_switch() but be cleared only in userret().  Hmm, I think
> > all of the TDF_NEEDRESCHED handling actually beings in sched_userret()
> > btw.
>
> Wouldn't this lead to unnecessary round-robin switches between threads
> with the same priority on sched_4bsd?

It can on 4.x then.  The idea is that the occasional spurious context switch 
is cheaper than doing a lot of work on each kernel exit.

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org