From owner-freebsd-arch@FreeBSD.ORG Fri Dec 17 06:20:46 2010 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 535C81065672; Fri, 17 Dec 2010 06:20:46 +0000 (UTC) (envelope-from davidxu@freebsd.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:4f8:fff6::28]) by mx1.freebsd.org (Postfix) with ESMTP id 23B2B8FC14; Fri, 17 Dec 2010 06:20:46 +0000 (UTC) Received: from xyf.my.dom (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id oBH6Kilh067082; Fri, 17 Dec 2010 06:20:44 GMT (envelope-from davidxu@freebsd.org) Message-ID: <4D0B013F.3060203@freebsd.org> Date: Fri, 17 Dec 2010 14:20:47 +0800 From: David Xu User-Agent: Thunderbird 2.0.0.24 (X11/20100630) MIME-Version: 1.0 To: Julian Elischer References: <201012101050.45214.jhb@freebsd.org> <201012150938.44217.jhb@freebsd.org> <4D0992B5.7060005@freebsd.org> <201012160940.58116.jhb@freebsd.org> <4D0A54A8.90901@freebsd.org> In-Reply-To: <4D0A54A8.90901@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: arch@freebsd.org, Sergey Babkin Subject: Re: Realtime thread priorities X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Dec 2010 06:20:46 -0000 Julian Elischer wrote: > On 12/16/10 6:40 AM, John Baldwin wrote: >> On Wednesday, December 15, 2010 11:16:53 pm David Xu wrote: >>> John Baldwin wrote: >>>> On Tuesday, December 14, 2010 8:40:12 pm David Xu wrote: >>>>> John Baldwin wrote: >>>>>> On Monday, December 13, 2010 8:30:24 pm David Xu wrote: >>>>>>> John Baldwin wrote: >>>>>>>> On Sunday, December 12, 2010 3:06:20 pm Sergey Babkin wrote: >>>>>>>>> John Baldwin wrote: >>>>>>>>>> The current layout breaks up the global thread priority space >>>>>>>>>> (0 - 255) >>>>>>>> into a >>>>>>>>>> couple of bands: >>>>>>>>>> >>>>>>>>>> 0 - 63 : interrupt threads >>>>>>>>>> 64 - 127 : kernel sleep priorities (PSOCK, etc.) >>>>>>>>>> 128 - 159 : real-time user threads (rtprio) >>>>>>>>>> 160 - 223 : time-sharing user threads >>>>>>>>>> 224 - 255 : idle threads (idprio and kernel idle procs) >>>>>>>>>> >>>>>>>>>> If we decide to change the behavior I see two possible fixes: >>>>>>>>>> >>>>>>>>>> 1) (easy) just move the real-time priority range above the >>>>>>>>>> kernel sleep >>>>>>>>>> priority range >>>>>>>>> Would not this cause a priority inversion when an RT process >>>>>>>>> enters the kernel mode? >>>>>>>> How so? Note that timesharing threads are not "bumped" to a >>>>>>>> kernel sleep >>>>>>>> priority when they enter the kernel either. The kernel sleep >>>>>>>> priorities are >>>>>>>> purely a way for certain sleep channels to cause a thread to be >>>>>>>> treated as >>>>>>>> interactive and give it a priority boost to favor interactive >>>>>>>> threads. >>>>>>>> Threads in the kernel do not automatically have higher priority >>>>>>>> than threads >>>>>>>> not in the kernel. Keep in mind that all stopped threads >>>>>>>> (threads not >>>>>>>> executing) are always in the kernel when they stop. >>>>>>> I have requirement to make a thread running in kernel has more >>>>>>> higher >>>>>>> priority over a thread running userland code, because our kernel >>>>>>> mutex is not sleepable which does not like Solaris did, I have to >>>>>>> use >>>>>>> semaphore like code in kern_umtx.c to lock a chain, which allows me >>>>>>> to read and write user address space, this is how umtxq_busy() did, >>>>>>> but it does not prevent a userland thread from preempting a thread >>>>>>> which locked the chain, if a realtime thread preempts a thread >>>>>>> locked the chain, it may lock up whole processes using pthread. >>>>>>> I think our realtime scheduling is not very useful, it is too easy >>>>>>> to lock up system. >>>>>> Users are not forced to use rtprio. They choose to do so, and >>>>>> they have to >>>>>> be root to enable it (either directly or by extending root >>>>>> privileges via >>>>>> sudo or some such). Just because you don't have a use case for it >>>>>> doesn't >>>>>> mean that other people do not. Right now there is no way possible >>>>>> to say >>>>>> that a given userland process is more important than 'sshd' (or >>>>>> any other >>>>>> daemon) blocked in poll/select/kevent waiting for a packet. >>>>>> However, there >>>>>> are use cases where other long-running userland processes are in >>>>>> fact far >>>>>> more important than sshd (or similar processes such as getty, etc.). >>>>>> >>>>> You still don't answer me about how to avoid a time-sharing thread >>>>> holding a critical kernel resource which preempted by a user RT >>>>> thread, >>>>> and later the RT thread requires the resource, but the time-sharing >>>>> thread has no chance to run because another RT thread is dominating >>>>> the CPU because it is doing CPU bound work, result is deadlock, >>>>> even if >>>>> you know you trust your RT process, there are many code which were >>>>> written by you, i.e the libc and any other libraries using threading >>>>> are completely not ready for RT use. >>>>> How ever let a thread in kernel have higher priority over a thread >>>>> running userland code will fix such a deadlock in kernel. >>>> Put another way, the time-sharing thread that I don't care about >>>> (sshd, or >>>> some other monitoring daemon, etc.) is stealing a resource I care about >>>> (time, in the form of CPU cycles) from my RT process that is >>>> critical to >>>> getting my work done. >>>> >>>> Beyond that a few more points: >>>> >>>> - You are ignoring "tools, not policy". You don't know what is in >>>> my binary >>>> (and I can't really tell you). Assume for a minute that I'm not >>>> completely >>>> dumb and can write userland code that is safe to run at this high >>>> of a >>>> priority level. You already trust me to write code in the kernel >>>> that runs >>>> at even higher priority now. :) >>>> - You repeatedly keep missing (ignoring?) the fact that this is >>>> _optional_. >>>> Users have to intentionally decide to enable this, and there are >>>> users who >>>> do _need_ this functionality. >>>> - You have also missed that this has always been true for idprio >>>> processes >>>> (and is in fact why we restrict idprio to root), so this is not >>>> "new". >>>> - Finally, you also are missing that this can already happen _now_ >>>> for plain >>>> old time sharing processes if the thread holding the resource >>>> doesn't ever >>>> do a sleep that raises the priority. >>>> >>>> For example, if a time-sharing thread with some typical priority>= >>>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode >>>> lock for >>>> that file (if it is unlocked) and hold that lock while it's priority >>>> is>= >>>> PRI_MIN_TIMESHARE. If an interrupt arrives for a network packet >>>> that wakes >>>> up sshd for a new SSH connection, the interrupt thread will preempt the >>>> thread holding the vnode lock, and sshd will be executed instead of the >>>> thread holding the vnode lock when the ithread finishes. If sshd >>>> needs the >>>> vnode lock that the original thread holds, then sshd will block >>>> until the >>>> original thread is rescheduled due to the random fates of time and >>>> releases >>>> the vnode lock. >>>> >>>> In summary, the kernel sleep priorities do _not_ serve to prevent all >>>> priority inversions, what they do accomplish is giving preferential >>>> treatment >>>> to idle, "interactive" threads. >>>> >>>> A bit more information on my use case btw: >>>> >>>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we >>>> remove the >>>> CPU from the global cpuset and ensure no interrupts are routed to >>>> that CPU). >>>> The problem I have is that if my RT process blocks on a lock (e.g. a >>>> lock on a >>>> VM object during a page fault), then I want the RT thread to lend >>>> its RT >>>> priority to the thread that holds the lock over on another CPU so >>>> that the lock >>>> can be released as quickly as possible. This use case is perfectly >>>> safe (the >>>> RT thread is not preempting other threads, instead other threads are >>>> partitioned >>>> off into a separate set of available CPUs). What I need is to >>>> ensure that the >>>> syncer or pagedaemon or whoever holds the lock I need gets a chance >>>> to run right >>>> away when it holds a lock that I need. >>>> >>> What I meant is that whenever thread is in kernel mode, it always has >>> higher priority over thread running user code, and all threads in kernel >>> mode may have same priority except those interrupt threads which >>> has higher priority, but this should be carefully designed to use >>> mutex and spinlock between interrupt threads and other threads, >>> mutex uses turnstile to propagate priority, spin lock disables >>> interrupt, otherwise there still is priority inversion in kernel, i.e >>> rwlock, sx lock. >> Except that this isn't really true. Really, if a thread is asleep in >> select() or poll() or kevent(), what critical resource is it holding? >> I had >> the same view originally when the current set of priorites were setup. >> However, I've had to change it since I now have a real-world use case for >> rtprio. >> >> First, I think this is the easy part of the argument: Can you agree >> that if >> a RT process is in the kernel, it should have priority over a TS >> process in >> the kernel? Thus, if a RT process blocks in the kernel, it would need to >> lend enough of a priority to the lock holder to preempt any TS process >> in the >> kernel, yes? If so, that argues for RT processes in the kernel having a >> higher priority than all the other kernel sleep priorities. >> >> The second part is harder, and that is what happens when a RT process >> is in >> userland. First, some food for thought. Do you realize that >> currently, the >> syncer and pagedaemon threads run at PVM? This is intentional so that >> these >> processes run in the "background" even though they are in the kernel. >> Specifically, when sshd does wakeup from a sleep at PSOCK or the like, >> the >> kernel doesn't just let it run in the kernel, it effectively lets it keep >> that PSOCK priority in userland until the next context switch due to an >> interrupt or the quantum expiring. This means that when you ssh into >> a box, >> the your interactive typing ends up preempting syncer and pagedaemon. >> And >> this is a good thing, because syncer and pagedaemon are _background_ >> processes. Preempting them only for the kernel portion of sshd (as the >> change to userret in both your proposal and my original #2 would do) >> would >> not really favor interactive processes because the user relies on the >> userland portion of an interactive process to run, too (userland is >> the part >> that echos back the characters as they are typed). So even now, with TS >> threads, we have TS userland code that is _more important_ than code >> in the >> kernel. Another example is the idlezero kernel process. This is kernel >> code, but is easily far less important than pretty much all userland >> code. >> Kernel code is _not_ always more important than userland code. It >> often is, >> but it sometimes isn't. If you can accept that, then it is no longer >> strange >> to consider that even the userland code in a RT process is more important >> than kernel code in a TS process. >> >> In our case we do chew up a lot of CPU in userland for our RT >> processes, but >> we handle this case by using dedicated CPUs. Our RT processes really >> are the >> most important processes on the box. >> > > I have to agree with John on this one.. > The real-time property for threads is a dangerous tool which we allow a > system "Adminstrator" (i.e. someone with root,) to do some things. > It is perfectly understood that doing the WRONG thing will negatively > impact the system (maybe even make it unworkable). However the decision to > set a process to realtime mode means that the Administrator has decided > that > that process/thread is more importnat than everything else in the system. > One could argue about whether this applies to interrupts, but in the > modern day > of even cell phones having multiple processors, it gets harder and harder > to make the case that userland code should not be able to pre-empt > or block kernel code. > > I think this philosophy has always been true.. As Terry Lambert used to > say > at the beginning of the project: Unix's job is to delver the bullet to > where-ever the > user wants to put it, including the user's foot. When you are the > administrator > you get to have a pretty big foot. > > In addition many of freeBSD's 'Users' are in fact producers of 'product' > boxes. > They know EXACTLY what is running on the system, and where, and want the > ability > to label a process in the way that John shows. For them it is the > primary purpose > of the box to do task X and doing task X comes before all other tasks, > possibly even > non related interrupts. > > Julian > The main problem is correctness, not if root can use it or not, I know it is his machine, he can do whatever he wants to do. :-) I have to repeat: The question is can the kernel correctly schedule RT threads ? no. The fact is so many lock semantics are not RT safe, lockmgr, sx lock, rwlock and other locks based on msleep/wakeup which do not use priority propagating or do not protect priority have priority inversion. Also the PPQ = 4 is incorrect for RT scheduling, it is another kind of priority inversion. So what can we do here ? if mutex and spin lock can not be used, it should either raise thread's priority to a high enough level or all threads have equal priority in kernel. If future changes can not fix the above problems, those changes are nonsense.