From owner-freebsd-arch@FreeBSD.ORG  Sat Dec 11 06:15:03 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 4740A1065670
	for <arch@freebsd.org>; Sat, 11 Dec 2010 06:15:03 +0000 (UTC)
	(envelope-from julian@freebsd.org)
Received: from out-0.mx.aerioconnect.net (out-0-31.mx.aerioconnect.net
	[216.240.47.91])
	by mx1.freebsd.org (Postfix) with ESMTP id 1FA768FC1B
	for <arch@freebsd.org>; Sat, 11 Dec 2010 06:15:02 +0000 (UTC)
Received: from idiom.com (postfix@mx0.idiom.com [216.240.32.160])
	by out-0.mx.aerioconnect.net (8.13.8/8.13.8) with ESMTP id
	oBB5v1h7015195; Fri, 10 Dec 2010 21:57:02 -0800
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
X-Client-Authorized: MaGic Cook1e
Received: from julian-mac.elischer.org
	(h-67-100-89-137.snfccasy.static.covad.net [67.100.89.137])
	by idiom.com (Postfix) with ESMTP id D77652D6019;
	Fri, 10 Dec 2010 21:57:00 -0800 (PST)
Message-ID: <4D0312AA.7010009@freebsd.org>
Date: Fri, 10 Dec 2010 21:56:58 -0800
From: Julian Elischer <julian@freebsd.org>
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10.4; en-US;
	rv:1.9.2.12) Gecko/20101027 Thunderbird/3.1.6
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <201012101050.45214.jhb@freebsd.org>	<201012101133.55389.jhb@freebsd.org>	<20101210195716.GE33073@deviant.kiev.zoral.com.ua>
	<201012101641.51652.jhb@freebsd.org>
In-Reply-To: <201012101641.51652.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
X-Scanned-By: MIMEDefang 2.67 on 216.240.47.51
Cc: Kostik Belousov <kostikbel@gmail.com>, arch@freebsd.org
Subject: Re: Realtime thread priorities
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sat, 11 Dec 2010 06:15:03 -0000

On 12/10/10 1:41 PM, John Baldwin wrote:
> On Friday, December 10, 2010 2:57:16 pm Kostik Belousov wrote:
>> On Fri, Dec 10, 2010 at 11:33:55AM -0500, John Baldwin wrote:
>>> On Friday, December 10, 2010 11:26:31 am Kostik Belousov wrote:
>>>> On Fri, Dec 10, 2010 at 10:50:45AM -0500, John Baldwin wrote:
>>>>> So I finally had a case today where I wanted to use rtprio but it doesn't seem
>>>>> very useful in its current state.  Specifically, I want to be able to tag
>>>>> certain user processes as being more important than any other user processes
>>>>> even to the point that if one of my important processes blocks on a mutex, the
>>>>> owner of that mutex should be more important than sshd being woken up from
>>>>> sbwait by new data (for example).  This doesn't work currently with rtprio due
>>>>> to the way the priorities are laid out (and I believe I probably argued for
>>>>> the current layout back when it was proposed).
>>>>>
>>>>> The current layout breaks up the global thread priority space (0 - 255) into a
>>>>> couple of bands:
>>>>>
>>>>>    0 -  63 : interrupt threads
>>>>>   64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>> 160 - 223 : time-sharing user threads
>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>
>>>>> The problem I am running into is that when a time-sharing thread goes to sleep
>>>>> in the kernel (waiting on select, socket data, tty, etc.) it actually ends up
>>>>> in the kernel priorities range (64 - 127).  This means when it wakes up it
>>>>> will trump (and preempt) a real-time user thread even though these processes
>>>>> nominally have a priority down in the 160 - 223 range.  We do drop the kernel
>>>>> sleep priority during userret(), but we don't recheck the scheduler queues to
>>>>> see if we should preempt the thread during userret(), so it effectively runs
>>>>> with the kernel sleep priority for the rest of the quantum while it is in
>>>>> userland.
>>>>>
>>>>> My first question is if this behavior is the desired behavior?  Originally I
>>>>> think I preferred the current layout because I thought a thread in the kernel
>>>>> should always have priority so it can release locks, etc.  However, priority
>>>>> propagation should actually handle the case of some very important thread
>>>>> needing a lock.  In my use case today where I actually want to use rtprio I
>>>>> think I want different behavior where the rtprio thread is more important than
>>>>> the thread waking up with PSOCK, etc.
>>>>>
>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>
>>>>> 1) (easy) just move the real-time priority range above the kernel sleep
>>>>> priority range
>>>>>
>>>>> 2) (harder) make sched_userret() check the run queue to see if it should
>>>>> preempt when dropping the kernel sleep priority.  I think bde@ has suggested
>>>>> that we should do this for correctness previously (and I've had some old,
>>>>> unfinished patches to do this in a branch in p4 for several years).

If you think how RT scheduling works when they stick an RT shim under 
an OS
then it becomes obvious that all RT threads trump all TS threads, 
kernel or not.
basically they have a separate RT scheduler that gets to schedule all 
RT threads and
they only even bother to run the NON RT (TS) scheduler when there is 
spare time.
TS threads are only ever scheduled by the RT scheduler when they own some
resource needed by an RT thread.

>>>> Would not doing #2 allow e.g. two threads that perform ping-pong with
>>>> a single byte read/write into a socket to usurp the CPU ? The threads
>>>> could try to also do some CPU-intensive calculations for some time
>>>> during the quantum too.
>>>>
>>>> Such threads are arguably "interactive", but I think that the gain is
>>>> priority is too unfair.

the aim of RT is to be unfair.
(to TS threads)

>>> Err, I think that what you describe is the current case and is what #2 would
>>> seek to change.
>> Sorry, might be my language was not clear, but I said "Would not doing
>> #2 allow ...", i.e. I specifically mean that we shall do #2 to avoid the
>> situation I described.
> Ah, yes, it does allow that.  As bde@ said though, the overhead of extra
> context switches in the common case might not be worth it.
>
> I have a possible patch for 1), but it involves fixing a few places and is
> only compile tested yet (will run test it soon).  I also think that in my
> case I almost always want 1) anyway (my realtime processes are always more
> important than sshd, even while sshd is in the kernel):
>
> Index: kern/kern_synch.c
> ===================================================================
> --- kern/kern_synch.c	(revision 215592)
> +++ kern/kern_synch.c	(working copy)
> @@ -214,7 +214,8 @@
>   	 * Adjust this thread's priority, if necessary.
>   	 */
>   	pri = priority&  PRIMASK;
> -	if (pri != 0&&  pri != td->td_priority) {
> +	if (pri != 0&&  pri != td->td_priority&&
> +	    td->td_pri_class == PRI_TIMESHARE) {
>   		thread_lock(td);
>   		sched_prio(td, pri);
>   		thread_unlock(td);
> @@ -552,7 +553,8 @@
>   {
>
>   	thread_lock(td);
> -	sched_prio(td, PRI_MAX_TIMESHARE);
> +	if (td->td_pri_class == PRI_TIMESHARE)
> +		sched_prio(td, PRI_MAX_TIMESHARE);
>   	mi_switch(SW_VOL, NULL);
>   	thread_unlock(td);
>   	td->td_retval[0] = 0;
> Index: kern/subr_sleepqueue.c
> ===================================================================
> --- kern/subr_sleepqueue.c	(revision 215592)
> +++ kern/subr_sleepqueue.c	(working copy)
> @@ -693,7 +720,8 @@
>
>   	/* Adjust priority if requested. */
>   	MPASS(pri == -1 || (pri>= PRI_MIN&&  pri<= PRI_MAX));
> -	if (pri != -1&&  td->td_priority>  pri)
> +	if (pri != -1&&  td->td_priority>  pri&&
> +	    td->td_pri_class == PRI_TIMESHARE)
>   		sched_prio(td, pri);
>   	return (setrunnable(td));
>   }
> Index: sys/priority.h
> ===================================================================
> --- sys/priority.h	(revision 215592)
> +++ sys/priority.h	(working copy)
> @@ -68,8 +68,8 @@
>    * are insignificant.  Ranges are as follows:
>    *
>    * Interrupt threads:		0 - 63
> - * Top half kernel threads:	64 - 127
> - * Realtime user threads:	128 - 159
> + * Realtime user threads:	64 - 95
> + * Top half kernel threads:	96 - 159
>    * Time sharing user threads:	160 - 223
>    * Idle user threads:		224 - 255
>    *
> @@ -81,7 +81,7 @@
>   #define	PRI_MAX			(255)		/* Lowest priority. */
>
>   #define	PRI_MIN_ITHD		(PRI_MIN)
> -#define	PRI_MAX_ITHD		(PRI_MIN_KERN - 1)
> +#define	PRI_MAX_ITHD		(PRI_MIN_REALTIME - 1)
>
>   #define	PI_REALTIME		(PRI_MIN_ITHD + 0)
>   #define	PI_AV			(PRI_MIN_ITHD + 4)
> @@ -94,9 +94,12 @@
>   #define	PI_DULL			(PRI_MIN_ITHD + 32)
>   #define	PI_SOFT			(PRI_MIN_ITHD + 36)
>
> -#define	PRI_MIN_KERN		(64)
> -#define	PRI_MAX_KERN		(PRI_MIN_REALTIME - 1)
> +#define	PRI_MIN_REALTIME	(64)
> +#define	PRI_MAX_REALTIME	(PRI_MIN_KERN - 1)
>
> +#define	PRI_MIN_KERN		(96)
> +#define	PRI_MAX_KERN		(PRI_MIN_TIMESHARE - 1)
> +
>   #define	PSWP			(PRI_MIN_KERN + 0)
>   #define	PVM			(PRI_MIN_KERN + 4)
>   #define	PINOD			(PRI_MIN_KERN + 8)
> @@ -109,9 +112,6 @@
>   #define	PLOCK			(PRI_MIN_KERN + 36)
>   #define	PPAUSE			(PRI_MIN_KERN + 40)
>
> -#define	PRI_MIN_REALTIME	(128)
> -#define	PRI_MAX_REALTIME	(PRI_MIN_TIMESHARE - 1)
> -
>   #define	PRI_MIN_TIMESHARE	(160)
>   #define	PRI_MAX_TIMESHARE	(PRI_MIN_IDLE - 1)
>
>