From owner-freebsd-arch@FreeBSD.ORG  Thu Dec 16 04:55:23 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 04A60106566C;
	Thu, 16 Dec 2010 04:55:23 +0000 (UTC)
	(envelope-from davidxu@freebsd.org)
Received: from freefall.freebsd.org (freefall.freebsd.org
	[IPv6:2001:4f8:fff6::28])
	by mx1.freebsd.org (Postfix) with ESMTP id DCE588FC16;
	Thu, 16 Dec 2010 04:55:22 +0000 (UTC)
Received: from xyf.my.dom (localhost [127.0.0.1])
	by freefall.freebsd.org (8.14.4/8.14.4) with ESMTP id oBG4tLrx086579;
	Thu, 16 Dec 2010 04:55:21 GMT (envelope-from davidxu@freebsd.org)
Message-ID: <4D099BBC.7050200@freebsd.org>
Date: Thu, 16 Dec 2010 12:55:24 +0800
From: David Xu <davidxu@freebsd.org>
User-Agent: Thunderbird 2.0.0.24 (X11/20100630)
MIME-Version: 1.0
To: Daniel Eischen <deischen@freebsd.org>
References: <201012101050.45214.jhb@freebsd.org>	<201012140756.52926.jhb@freebsd.org>	<4D081C7C.5040407@freebsd.org>
	<201012150938.44217.jhb@freebsd.org>
	<Pine.GSO.4.64.1012151115350.27084@sea.ntplx.net>
In-Reply-To: <Pine.GSO.4.64.1012151115350.27084@sea.ntplx.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: arch@freebsd.org
Subject: Re: Realtime thread priorities
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 16 Dec 2010 04:55:23 -0000

Daniel Eischen wrote:
> On Wed, 15 Dec 2010, John Baldwin wrote:
>>
>> Put another way, the time-sharing thread that I don't care about 
>> (sshd, or
>> some other monitoring daemon, etc.) is stealing a resource I care about
>> (time, in the form of CPU cycles) from my RT process that is critical to
>> getting my work done.
>>
>> Beyond that a few more points:
>>
>> - You are ignoring "tools, not policy".  You don't know what is in my 
>> binary
>>  (and I can't really tell you).  Assume for a minute that I'm not 
>> completely
>>  dumb and can write userland code that is safe to run at this high of a
>>  priority level.  You already trust me to write code in the kernel 
>> that runs
>>  at even higher priority now. :)
>> - You repeatedly keep missing (ignoring?) the fact that this is 
>> _optional_.
>>  Users have to intentionally decide to enable this, and there are 
>> users who
>>  do _need_ this functionality.
>> - You have also missed that this has always been true for idprio 
>> processes
>>  (and is in fact why we restrict idprio to root), so this is not "new".
>> - Finally, you also are missing that this can already happen _now_ for 
>> plain
>>  old time sharing processes if the thread holding the resource doesn't 
>> ever
>>  do a sleep that raises the priority.
>>
>> For example, if a time-sharing thread with some typical priority >=
>> PRI_MIN_TIMESHARE calls write(2) on a file, it can lock the vnode lock 
>> for
>> that file (if it is unlocked) and hold that lock while it's priority 
>> is >=
>> PRI_MIN_TIMESHARE.  If an interrupt arrives for a network packet that 
>> wakes
>> up sshd for a new SSH connection, the interrupt thread will preempt the
>> thread holding the vnode lock, and sshd will be executed instead of the
>> thread holding the vnode lock when the ithread finishes.  If sshd 
>> needs the
>> vnode lock that the original thread holds, then sshd will block until the
>> original thread is rescheduled due to the random fates of time and 
>> releases
>> the vnode lock.
>>
>> In summary, the kernel sleep priorities do _not_ serve to prevent all
>> priority inversions, what they do accomplish is giving preferential 
>> treatment
>> to idle, "interactive" threads.
>>
>> A bit more information on my use case btw:
>>
>> My RT processes are each assigned a _dedicated_ CPU via cpuset (we 
>> remove the
>> CPU from the global cpuset and ensure no interrupts are routed to that 
>> CPU).
>> The problem I have is that if my RT process blocks on a lock (e.g. a 
>> lock on a
>> VM object during a page fault), then I want the RT thread to lend its RT
>> priority to the thread that holds the lock over on another CPU so that 
>> the lock
>> can be released as quickly as possible.  This use case is perfectly 
>> safe (the
>> RT thread is not preempting other threads, instead other threads are 
>> partitioned
>> off into a separate set of available CPUs).  What I need is to ensure 
>> that the
>> syncer or pagedaemon or whoever holds the lock I need gets a chance to 
>> run right
>> away when it holds a lock that I need.
> 
> And speaking as a developer that writes applications that require
> real-time priorities, all of the above is a good summary.  As such
> a developer, I don't use real-time priorities to make applications
> run faster, have more throughput, get more work done, or anything
> like that.  It is to attempt to meet real world deadlines.  Our
> applications do not busy the CPU, they block mostly, waking up for
> and handling events - both periodic and aperiodic.  We know our
> applications run real-time, so we try to be as efficient as possible.
> If there is something more CPU intensive, then perhaps we'll have
> another lower priority thread/process to handle that task.  The
> important thing is that we need to meet or respond to a time-
> critical event.
> 
> We do expect that our real-time threads will run over time
> sharing or other lower priority threads, and that priority
> will be propagated for any contested OS locks.  In our situation,
> it is acceptable to starve low priority tasks, though we do
> design the applications to avoid that.
> 

I am not objecting RT scheduling, I just said the kernel is not ready
for RT use, it has priority inversion, as an example I even wrote code
to implement priority-inherit pthread mutex for libthr, this is for
RT programming.
But kernel has priority inversion, because the priority inversions,
it will not meet time critical requirement even if you configured the 
machine properly, this can not be fixed by proposed priority range
adjust.
What I have said and done is try to find a way to fix priority
inversion problem in kernel. I know msleep raises priority
is a hacking, if all user threads in kernel mode have same higher
level priority than those in user mode, the priority raising by
msleep may be eliminated, the realtime scheduling for user thread
still works once it returned to user mode as I said in
another reply.