From owner-freebsd-arch@FreeBSD.ORG Fri Dec 10 21:41:54 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 43D2B1065674 for ; Fri, 10 Dec 2010 21:41:54 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id 0524D8FC08 for ; Fri, 10 Dec 2010 21:41:54 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 7E49E46B29; Fri, 10 Dec 2010 16:41:53 -0500 (EST) Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9]) by bigwig.baldwin.cx (Postfix) with ESMTPSA id 3340D8A01D; Fri, 10 Dec 2010 16:41:52 -0500 (EST) From: John Baldwin To: Kostik Belousov Date: Fri, 10 Dec 2010 16:41:51 -0500 User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20101102; KDE/4.4.5; amd64; ; ) References: <201012101050.45214.jhb@freebsd.org> <201012101133.55389.jhb@freebsd.org> <20101210195716.GE33073@deviant.kiev.zoral.com.ua> In-Reply-To: <20101210195716.GE33073@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: Text/Plain; charset="iso-8859-15" Content-Transfer-Encoding: 7bit Message-Id: <201012101641.51652.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6 (bigwig.baldwin.cx); Fri, 10 Dec 2010 16:41:52 -0500 (EST) X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-1.9 required=4.2 tests=BAYES_00 autolearn=ham version=3.3.1 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx Cc: arch@freebsd.org Subject: Re: Realtime thread priorities X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 10 Dec 2010 21:41:54 -0000 On Friday, December 10, 2010 2:57:16 pm Kostik Belousov wrote: > On Fri, Dec 10, 2010 at 11:33:55AM -0500, John Baldwin wrote: > > On Friday, December 10, 2010 11:26:31 am Kostik Belousov wrote: > > > On Fri, Dec 10, 2010 at 10:50:45AM -0500, John Baldwin wrote: > > > > So I finally had a case today where I wanted to use rtprio but it doesn't seem > > > > very useful in its current state. Specifically, I want to be able to tag > > > > certain user processes as being more important than any other user processes > > > > even to the point that if one of my important processes blocks on a mutex, the > > > > owner of that mutex should be more important than sshd being woken up from > > > > sbwait by new data (for example). This doesn't work currently with rtprio due > > > > to the way the priorities are laid out (and I believe I probably argued for > > > > the current layout back when it was proposed). > > > > > > > > The current layout breaks up the global thread priority space (0 - 255) into a > > > > couple of bands: > > > > > > > > 0 - 63 : interrupt threads > > > > 64 - 127 : kernel sleep priorities (PSOCK, etc.) > > > > 128 - 159 : real-time user threads (rtprio) > > > > 160 - 223 : time-sharing user threads > > > > 224 - 255 : idle threads (idprio and kernel idle procs) > > > > > > > > The problem I am running into is that when a time-sharing thread goes to sleep > > > > in the kernel (waiting on select, socket data, tty, etc.) it actually ends up > > > > in the kernel priorities range (64 - 127). This means when it wakes up it > > > > will trump (and preempt) a real-time user thread even though these processes > > > > nominally have a priority down in the 160 - 223 range. We do drop the kernel > > > > sleep priority during userret(), but we don't recheck the scheduler queues to > > > > see if we should preempt the thread during userret(), so it effectively runs > > > > with the kernel sleep priority for the rest of the quantum while it is in > > > > userland. > > > > > > > > My first question is if this behavior is the desired behavior? Originally I > > > > think I preferred the current layout because I thought a thread in the kernel > > > > should always have priority so it can release locks, etc. However, priority > > > > propagation should actually handle the case of some very important thread > > > > needing a lock. In my use case today where I actually want to use rtprio I > > > > think I want different behavior where the rtprio thread is more important than > > > > the thread waking up with PSOCK, etc. > > > > > > > > If we decide to change the behavior I see two possible fixes: > > > > > > > > 1) (easy) just move the real-time priority range above the kernel sleep > > > > priority range > > > > > > > > 2) (harder) make sched_userret() check the run queue to see if it should > > > > preempt when dropping the kernel sleep priority. I think bde@ has suggested > > > > that we should do this for correctness previously (and I've had some old, > > > > unfinished patches to do this in a branch in p4 for several years). > > > > > > Would not doing #2 allow e.g. two threads that perform ping-pong with > > > a single byte read/write into a socket to usurp the CPU ? The threads > > > could try to also do some CPU-intensive calculations for some time > > > during the quantum too. > > > > > > Such threads are arguably "interactive", but I think that the gain is > > > priority is too unfair. > > > > Err, I think that what you describe is the current case and is what #2 would > > seek to change. > > Sorry, might be my language was not clear, but I said "Would not doing > #2 allow ...", i.e. I specifically mean that we shall do #2 to avoid the > situation I described. Ah, yes, it does allow that. As bde@ said though, the overhead of extra context switches in the common case might not be worth it. I have a possible patch for 1), but it involves fixing a few places and is only compile tested yet (will run test it soon). I also think that in my case I almost always want 1) anyway (my realtime processes are always more important than sshd, even while sshd is in the kernel): Index: kern/kern_synch.c =================================================================== --- kern/kern_synch.c (revision 215592) +++ kern/kern_synch.c (working copy) @@ -214,7 +214,8 @@ * Adjust this thread's priority, if necessary. */ pri = priority & PRIMASK; - if (pri != 0 && pri != td->td_priority) { + if (pri != 0 && pri != td->td_priority && + td->td_pri_class == PRI_TIMESHARE) { thread_lock(td); sched_prio(td, pri); thread_unlock(td); @@ -552,7 +553,8 @@ { thread_lock(td); - sched_prio(td, PRI_MAX_TIMESHARE); + if (td->td_pri_class == PRI_TIMESHARE) + sched_prio(td, PRI_MAX_TIMESHARE); mi_switch(SW_VOL, NULL); thread_unlock(td); td->td_retval[0] = 0; Index: kern/subr_sleepqueue.c =================================================================== --- kern/subr_sleepqueue.c (revision 215592) +++ kern/subr_sleepqueue.c (working copy) @@ -693,7 +720,8 @@ /* Adjust priority if requested. */ MPASS(pri == -1 || (pri >= PRI_MIN && pri <= PRI_MAX)); - if (pri != -1 && td->td_priority > pri) + if (pri != -1 && td->td_priority > pri && + td->td_pri_class == PRI_TIMESHARE) sched_prio(td, pri); return (setrunnable(td)); } Index: sys/priority.h =================================================================== --- sys/priority.h (revision 215592) +++ sys/priority.h (working copy) @@ -68,8 +68,8 @@ * are insignificant. Ranges are as follows: * * Interrupt threads: 0 - 63 - * Top half kernel threads: 64 - 127 - * Realtime user threads: 128 - 159 + * Realtime user threads: 64 - 95 + * Top half kernel threads: 96 - 159 * Time sharing user threads: 160 - 223 * Idle user threads: 224 - 255 * @@ -81,7 +81,7 @@ #define PRI_MAX (255) /* Lowest priority. */ #define PRI_MIN_ITHD (PRI_MIN) -#define PRI_MAX_ITHD (PRI_MIN_KERN - 1) +#define PRI_MAX_ITHD (PRI_MIN_REALTIME - 1) #define PI_REALTIME (PRI_MIN_ITHD + 0) #define PI_AV (PRI_MIN_ITHD + 4) @@ -94,9 +94,12 @@ #define PI_DULL (PRI_MIN_ITHD + 32) #define PI_SOFT (PRI_MIN_ITHD + 36) -#define PRI_MIN_KERN (64) -#define PRI_MAX_KERN (PRI_MIN_REALTIME - 1) +#define PRI_MIN_REALTIME (64) +#define PRI_MAX_REALTIME (PRI_MIN_KERN - 1) +#define PRI_MIN_KERN (96) +#define PRI_MAX_KERN (PRI_MIN_TIMESHARE - 1) + #define PSWP (PRI_MIN_KERN + 0) #define PVM (PRI_MIN_KERN + 4) #define PINOD (PRI_MIN_KERN + 8) @@ -109,9 +112,6 @@ #define PLOCK (PRI_MIN_KERN + 36) #define PPAUSE (PRI_MIN_KERN + 40) -#define PRI_MIN_REALTIME (128) -#define PRI_MAX_REALTIME (PRI_MIN_TIMESHARE - 1) - #define PRI_MIN_TIMESHARE (160) #define PRI_MAX_TIMESHARE (PRI_MIN_IDLE - 1) -- John Baldwin