From owner-freebsd-arch@FreeBSD.ORG  Fri Dec 17 06:20:46 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 535C81065672;
	Fri, 17 Dec 2010 06:20:46 +0000 (UTC)
	(envelope-from davidxu@freebsd.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id 23B2B8FC14;
	Fri, 17 Dec 2010 06:20:46 +0000 (UTC)
Received: from xyf.my.dom (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id oBH6Kilh067082;
	Fri, 17 Dec 2010 06:20:44 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <4D0B013F.3060203@freebsd.org>
Date: Fri, 17 Dec 2010 14:20:47 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100630)
MIME-Version: 1.0
To: Julian Elischer <julian@freebsd.org>
References: <201012101050.45214.jhb@freebsd.org>	<201012150938.44217.jhb@freebsd.org>	<4D0992B5.7060005@freebsd.org>
	<201012160940.58116.jhb@freebsd.org> <4D0A54A8.90901@freebsd.org>
In-Reply-To: <4D0A54A8.90901@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org, Sergey Babkin <babkin@verizon.net>
Subject: Re: Realtime thread priorities
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 17 Dec 2010 06:20:46 -0000

Julian Elischer wrote:
> On 12/16/10 6:40 AM, John Baldwin wrote:
>> On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote:
>>> John Baldwin wrote:
>>>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote:
>>>>> John Baldwin wrote:
>>>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote:
>>>>>>> John Baldwin wrote:
>>>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote:
>>>>>>>>> John Baldwin wrote:
>>>>>>>>>> The current layout breaks up the global thread priority space 
>>>>>>>>>> (0 - 255)
>>>>>>>> into a
>>>>>>>>>> couple of bands:
>>>>>>>>>>
>>>>>>>>>>    0 -  63 : interrupt threads
>>>>>>>>>>   64 - 127 : kernel sleep priorities (PSOCK, etc.)
>>>>>>>>>> 128 - 159 : real-time user threads (rtprio)
>>>>>>>>>> 160 - 223 : time-sharing user threads
>>>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs)
>>>>>>>>>>
>>>>>>>>>> If we decide to change the behavior I see two possible fixes:
>>>>>>>>>>
>>>>>>>>>> 1) (easy) just move the real-time priority range above the 
>>>>>>>>>> kernel sleep
>>>>>>>>>> priority range
>>>>>>>>> Would not this cause a priority inversion when an RT process
>>>>>>>>> enters the kernel mode?
>>>>>>>> How so?  Note that timesharing threads are not "bumped" to a 
>>>>>>>> kernel sleep
>>>>>>>> priority when they enter the kernel either.  The kernel sleep 
>>>>>>>> priorities are
>>>>>>>> purely a way for certain sleep channels to cause a thread to be 
>>>>>>>> treated as
>>>>>>>> interactive and give it a priority boost to favor interactive 
>>>>>>>> threads.
>>>>>>>> Threads in the kernel do not automatically have higher priority 
>>>>>>>> than threads
>>>>>>>> not in the kernel.  Keep in mind that all stopped threads 
>>>>>>>> (threads not
>>>>>>>> executing) are always in the kernel when they stop.
>>>>>>> I have requirement to make a thread running in kernel has more 
>>>>>>> higher
>>>>>>> priority over a thread running userland code, because our kernel
>>>>>>> mutex is not sleepable which does not like Solaris did, I have to 
>>>>>>> use
>>>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me
>>>>>>> to read and write user address space, this is how umtxq_busy() did,
>>>>>>> but it does not prevent a userland thread from preempting a thread
>>>>>>> which locked the chain, if a realtime thread preempts a thread
>>>>>>> locked the chain, it may lock up whole processes using pthread.
>>>>>>> I think our realtime scheduling is not very useful, it is too easy
>>>>>>> to lock up system.
>>>>>> Users are not forced to use rtprio.  They choose to do so, and 
>>>>>> they have to
>>>>>> be root to enable it (either directly or by extending root 
>>>>>> privileges via
>>>>>> sudo or some such).  Just because you don't have a use case for it 
>>>>>> doesn't
>>>>>> mean that other people do not.  Right now there is no way possible 
>>>>>> to say
>>>>>> that a given userland process is more important than 'sshd' (or 
>>>>>> any other
>>>>>> daemon) blocked in poll/select/kevent waiting for a packet.  
>>>>>> However, there
>>>>>> are use cases where other long-running userland processes are in 
>>>>>> fact far
>>>>>> more important than sshd (or similar processes such as getty, etc.).
>>>>>>
>>>>> You still don't answer me about how to avoid a time-sharing thread
>>>>> holding a critical kernel resource which preempted by a user RT 
>>>>> thread,
>>>>> and later the RT thread requires the resource, but the time-sharing
>>>>> thread has no chance to run because another RT thread is dominating
>>>>> the CPU because it is doing CPU bound work, result is deadlock, 
>>>>> even if
>>>>> you know you trust your RT process, there are many code which were
>>>>> written by you, i.e the libc and any other libraries using threading
>>>>> are completely not ready for RT use.
>>>>> How ever let a thread in kernel have higher priority over a thread
>>>>> running userland code will fix such a deadlock in kernel.
>>>> Put another way, the time-sharing thread that I don't care about 
>>>> (sshd, or
>>>> some other monitoring daemon, etc.) is stealing a resource I care about
>>>> (time, in the form of CPU cycles) from my RT process that is 
>>>> critical to
>>>> getting my work done.
>>>>
>>>> Beyond that a few more points:
>>>>
>>>> - You are ignoring "tools, not policy".  You don't know what is in 
>>>> my binary
>>>>    (and I can't really tell you).  Assume for a minute that I'm not 
>>>> completely
>>>>    dumb and can write userland code that is safe to run at this high 
>>>> of a
>>>>    priority level.  You already trust me to write code in the kernel 
>>>> that runs
>>>>    at even higher priority now. :)
>>>> - You repeatedly keep missing (ignoring?) the fact that this is 
>>>> _optional_.
>>>>    Users have to intentionally decide to enable this, and there are 
>>>> users who
>>>>    do _need_ this functionality.
>>>> - You have also missed that this has always been true for idprio 
>>>> processes
>>>>    (and is in fact why we restrict idprio to root), so this is not 
>>>> "new".
>>>> - Finally, you also are missing that this can already happen _now_ 
>>>> for plain
>>>>    old time sharing processes if the thread holding the resource 
>>>> doesn't ever
>>>>    do a sleep that raises the priority.
>>>>
>>>> For example, if a time-sharing thread with some typical priority>=
>>>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode 
>>>> lock for
>>>> that file (if it is unlocked) and hold that lock while it's priority 
>>>> is>=
>>>> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet 
>>>> that wakes
>>>> up sshd for a new SSH connection, the interrupt thread will preempt the
>>>> thread holding the vnode lock, and sshd will be executed instead of the
>>>> thread holding the vnode lock when the ithread finishes.  If sshd 
>>>> needs the
>>>> vnode lock that the original thread holds, then sshd will block 
>>>> until the
>>>> original thread is rescheduled due to the random fates of time and 
>>>> releases
>>>> the vnode lock.
>>>>
>>>> In summary, the kernel sleep priorities do _not_ serve to prevent all
>>>> priority inversions, what they do accomplish is giving preferential 
>>>> treatment
>>>> to idle, "interactive" threads.
>>>>
>>>> A bit more information on my use case btw:
>>>>
>>>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we 
>>>> remove the
>>>> CPU from the global cpuset and ensure no interrupts are routed to 
>>>> that CPU).
>>>> The problem I have is that if my RT process blocks on a lock (e.g. a 
>>>> lock on a
>>>> VM object during a page fault), then I want the RT thread to lend 
>>>> its RT
>>>> priority to the thread that holds the lock over on another CPU so 
>>>> that the lock
>>>> can be released as quickly as possible.  This use case is perfectly 
>>>> safe (the
>>>> RT thread is not preempting other threads, instead other threads are 
>>>> partitioned
>>>> off into a separate set of available CPUs).  What I need is to 
>>>> ensure that the
>>>> syncer or pagedaemon or whoever holds the lock I need gets a chance 
>>>> to run right
>>>> away when it holds a lock that I need.
>>>>
>>> What I meant is that whenever thread is in kernel mode, it always has
>>> higher priority over thread running user code, and all threads in kernel
>>> mode may have same priority except those interrupt threads which
>>> has higher priority, but this should be carefully designed to use
>>> mutex and spinlock between interrupt threads and other threads,
>>> mutex uses turnstile to propagate priority, spin lock disables
>>> interrupt, otherwise there still is priority inversion in kernel, i.e
>>> rwlock, sx lock.
>> Except that this isn't really true.  Really, if a thread is asleep in
>> select() or poll() or kevent(), what critical resource is it holding?  
>> I had
>> the same view originally when the current set of priorites were setup.
>> However, I've had to change it since I now have a real-world use case for
>> rtprio.
>>
>> First, I think this is the easy part of the argument:  Can you agree 
>> that if
>> a RT process is in the kernel, it should have priority over a TS 
>> process in
>> the kernel?  Thus, if a RT process blocks in the kernel, it would need to
>> lend enough of a priority to the lock holder to preempt any TS process 
>> in the
>> kernel, yes?  If so, that argues for RT processes in the kernel having a
>> higher priority than all the other kernel sleep priorities.
>>
>> The second part is harder, and that is what happens when a RT process 
>> is in
>> userland.  First, some food for thought.  Do you realize that 
>> currently, the
>> syncer and pagedaemon threads run at PVM?  This is intentional so that 
>> these
>> processes run in the "background" even though they are in the kernel.
>> Specifically, when sshd does wakeup from a sleep at PSOCK or the like, 
>> the
>> kernel doesn't just let it run in the kernel, it effectively lets it keep
>> that PSOCK priority in userland until the next context switch due to an
>> interrupt or the quantum expiring.  This means that when you ssh into 
>> a box,
>> the your interactive typing ends up preempting syncer and pagedaemon.  
>> And
>> this is a good thing, because syncer and pagedaemon are _background_
>> processes.  Preempting them only for the kernel portion of sshd (as the
>> change to userret in both your proposal and my original #2 would do) 
>> would
>> not really favor interactive processes because the user relies on the
>> userland portion of an interactive process to run, too (userland is 
>> the part
>> that echos back the characters as they are typed).  So even now, with TS
>> threads, we have TS userland code that is _more important_ than code 
>> in the
>> kernel.  Another example is the idlezero kernel process.  This is kernel
>> code, but is easily far less important than pretty much all userland 
>> code.
>> Kernel code is _not_ always more important than userland code.  It 
>> often is,
>> but it sometimes isn't.  If you can accept that, then it is no longer 
>> strange
>> to consider that even the userland code in a RT process is more important
>> than kernel code in a TS process.
>>
>> In our case we do chew up a lot of CPU in userland for our RT 
>> processes, but
>> we handle this case by using dedicated CPUs.  Our RT processes really 
>> are the
>> most important processes on the box.
>>
> 
> I have to agree with John on this one..
> The real-time property for threads is a dangerous tool which we allow a
> system "Adminstrator"  (i.e. someone with root,) to do some things.
> It is perfectly understood that doing the WRONG thing will negatively
> impact the system (maybe even make it unworkable). However the decision to
> set a process to realtime mode means that the Administrator has decided 
> that
> that process/thread is more importnat than everything else in the system.
> One could argue about whether this applies to interrupts, but in the 
> modern day
> of even cell phones having multiple processors, it gets harder and harder
> to make the case that userland code should not be able to pre-empt
> or block kernel code.
> 
> I think this philosophy has always been true..  As Terry Lambert used to 
> say
> at the beginning of the project: Unix's job is to delver the bullet to 
> where-ever the
> user wants to put it, including the user's foot.  When you are the 
> administrator
> you get to have  a pretty big foot.
> 
> In addition many of freeBSD's 'Users' are in fact producers of 'product' 
> boxes.
> They know EXACTLY what is running on the system, and where, and want the 
> ability
> to label a process in the way that John shows.  For them it is the 
> primary purpose
> of the box to do task X and doing task X comes before all other tasks, 
> possibly even
> non related interrupts.
> 
> Julian
> 

The main problem is correctness, not if root can use it or not,
I know it is his machine, he can do whatever he wants to do. :-)
I have to repeat:
The question is can the kernel correctly schedule RT threads ? no.
The fact is so many lock semantics are not RT safe, lockmgr, sx lock,
rwlock and other locks based on msleep/wakeup which do not use
priority propagating or do not protect priority have priority inversion.
Also the PPQ = 4 is incorrect for RT scheduling, it is another
kind of priority inversion.
So what can we do here ? if mutex and spin lock can not be used,
it should either raise thread's priority to a high enough
level or all threads have equal priority in kernel.
If future changes can not fix the above problems, those changes
are nonsense.