From owner-freebsd-arch@FreeBSD.ORG  Tue Dec 28 19:58:23 2010
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: freebsd-arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 99C0C1065670
	for <freebsd-arch@freebsd.org>; Tue, 28 Dec 2010 19:58:23 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 5C95D8FC08
	for <freebsd-arch@freebsd.org>; Tue, 28 Dec 2010 19:58:23 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id EF11F46B06
	for <freebsd-arch@freebsd.org>; Tue, 28 Dec 2010 14:58:22 -0500 (EST)
Received: from jhbbsd.localnet (smtp.hudson-trading.com [209.249.190.9])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id E1D208A009
	for <freebsd-arch@freebsd.org>; Tue, 28 Dec 2010 14:58:21 -0500 (EST)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-arch@freebsd.org
Date: Tue, 28 Dec 2010 14:58:21 -0500
User-Agent: KMail/1.13.5 (FreeBSD/7.3-CBSD-20101102; KDE/4.4.5; amd64; ; )
References: <201012101050.45214.jhb@freebsd.org>
In-Reply-To: <201012101050.45214.jhb@freebsd.org>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201012281458.21413.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.6
	(bigwig.baldwin.cx); Tue, 28 Dec 2010 14:58:22 -0500 (EST)
X-Virus-Scanned: clamav-milter 0.96.3 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-1.9 required=4.2 tests=BAYES_00 autolearn=ham
	version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on bigwig.baldwin.cx
Subject: Re: Realtime thread priorities
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 28 Dec 2010 19:58:23 -0000

On Friday, December 10, 2010 10:50:45 am John Baldwin wrote:
> So I finally had a case today where I wanted to use rtprio but it doesn't seem 
> very useful in its current state.  Specifically, I want to be able to tag 
> certain user processes as being more important than any other user processes 
> even to the point that if one of my important processes blocks on a mutex, the 
> owner of that mutex should be more important than sshd being woken up from 
> sbwait by new data (for example).  This doesn't work currently with rtprio due 
> to the way the priorities are laid out (and I believe I probably argued for 
> the current layout back when it was proposed).
> 
> The current layout breaks up the global thread priority space (0 - 255) into a 
> couple of bands:
> 
>   0 -  63 : interrupt threads
>  64 - 127 : kernel sleep priorities (PSOCK, etc.)
> 128 - 159 : real-time user threads (rtprio)
> 160 - 223 : time-sharing user threads
> 224 - 255 : idle threads (idprio and kernel idle procs)
> 
> The problem I am running into is that when a time-sharing thread goes to sleep 
> in the kernel (waiting on select, socket data, tty, etc.) it actually ends up 
> in the kernel priorities range (64 - 127).  This means when it wakes up it 
> will trump (and preempt) a real-time user thread even though these processes 
> nominally have a priority down in the 160 - 223 range.  We do drop the kernel 
> sleep priority during userret(), but we don't recheck the scheduler queues to 
> see if we should preempt the thread during userret(), so it effectively runs 
> with the kernel sleep priority for the rest of the quantum while it is in 
> userland.
> 
> My first question is if this behavior is the desired behavior?  Originally I 
> think I preferred the current layout because I thought a thread in the kernel 
> should always have priority so it can release locks, etc.  However, priority 
> propagation should actually handle the case of some very important thread 
> needing a lock.  In my use case today where I actually want to use rtprio I 
> think I want different behavior where the rtprio thread is more important than 
> the thread waking up with PSOCK, etc.
> 
> If we decide to change the behavior I see two possible fixes:
> 
> 1) (easy) just move the real-time priority range above the kernel sleep 
> priority range

I have forward-ported my original patch for 7 to 9 and fixed several other
nits I ran into along the way.  The updated patch is at
http://www.freebsd.org/~jhb/patches/rtpri.patch

I think it can probably be broken up into several pieces at least some of
which should be non-controversial. :)

This patch makes the following changes:

- Give the USB kthreads lower priority in the range of software interrupt
  threads rather than hardware interrupt threads.
- Retire some unused ithread priorities: PI_TTYHIGH, PI_TAPE, and
  PI_DISKLOW.  While here, rename PI_TTYLOW to PI_TTY.  Also, add a macro
  PI_SWI() that takes a SWI_* constant as an argument and returns the
  suitable thread priority.
- In sched_yield(), only drop the priority of timeshare threads to
  PRI_MAX_TIMESHARE.  Non-timeshare threads retain whatever priority they
  currently have.
- Only apply a kernel sleep priority from tsleep() to timeshare threads.
  This is only relevant once realtime threads move to a new priority range
  to avoid penalizing realtime threads for sleeping.
- Explicitly set a sane initial priority (of PVM) for kthreads.  Right now
  new kthreads inherit whatever priority thread0 happens to have when they
  are created.  Since kthreads can be created from threads other than thread0
  this priority can be fairly random.  In practice, I've seen many kthreads
  created with an initial priority that is a hardware interrupt thread
  priority due to thread0 being lent an ithread priority.
- Add some helper macros to ULE to define the ranges used for interactive
  and non-interactive timeshare threads and fix some places that hardcoded
  assumptions about the location of the realtime priority range.
- Add a new option (that should perhaps be on by default) for use in
  conjunction with moving realtime priorities ULE_INTERACTIVE_TIMESHARE.
  When this new option is in effect, ULE does not abuse realtime priorites
  for interactive timeshare threads.  Instead, the timeshare range is split
  into two ranges, one for interactive threads and one for non-interactive
  threads.  The non-interactive range is further divided into three ranges
  to add bands at the top and bottom for nice levels.  Combined with the
  other changes, the net effect is that interactive threads will have the
  same priority they have now (i.e. a band of 32 priorities in between
  kernel sleep priorities and non-interactive timeshare priorities) and
  that non-interactive threads now have a slightly larger band of
  priorities (32 priorities in the "middle" instead of 24 with additional
  bands of 20 above and below for nice values).
- Never boost the priority of a thread via tsleep() if the passed in
  priority is zero.  Zero means "don't change the priority", but ULE was
  still giving a boost in certain cases.  In practice I suspect this rarely,
  if ever, triggered.
- Always apply the requested sleep priority to kthreads.  Certain kernel
  processes such as pagedaemon, etc. rely on tsleep() to lower the priority
  of the kproc so that it is treated as a background task when it is idle.
  The static_boost code in ULE would never lower the priority due to a sleep,
  so once a kproc gained a higher priority via sleeping it would never be
  treated as a background task again.  This is especially problematic in the
  case that a kthread starts off with an ithread priority as noted above.
- Retire the PCONFIG kernel sleep priority.  We do not need a new priority
  level for boot time config hooks.  When PCONFIG was added, tsleep() did
  not support leaving the priority alone via 0, but now it does support that
  so use that instead.
- Restore dropping the syncer kthread down to PPAUSE when it is idle.
- Drop the flowtable cleaner kthread down to PPAUSE when it is idle.
- Move the realtime priority range in between the interrupt thread and
  kernel sleep priority range.  Currently there is a small bit of overlap
  between SWI_TQ and SWI_TQ_GIANT and 'rtprio 0'.  I hope to eliminate this
  by retiring SWI_TQ_FAST once interrupt filters are in place as then
  SwI_TQ and SWI_TQ_GIANT can move up a slot.
  

-- 
John Baldwin