From owner-freebsd-arch@FreeBSD.ORG  Tue May 29 20:11:43 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
X-Original-To: arch@FreeBSD.org
Delivered-To: freebsd-arch@FreeBSD.ORG
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id 42F7016A4CF
	for <arch@FreeBSD.org>; Tue, 29 May 2007 20:11:43 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail20.syd.optusnet.com.au (mail20.syd.optusnet.com.au
	[211.29.132.201])
	by mx1.freebsd.org (Postfix) with ESMTP id 02D6813C4C3
	for <arch@FreeBSD.org>; Tue, 29 May 2007 20:11:37 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from c211-30-225-63.carlnfd3.nsw.optusnet.com.au
	(c211-30-225-63.carlnfd3.nsw.optusnet.com.au [211.30.225.63])
	by mail20.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	l4TKBKlM008040
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Wed, 30 May 2007 06:11:24 +1000
Date: Wed, 30 May 2007 06:11:21 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Jeff Roberson <jroberson@chesapeake.net>
In-Reply-To: <20070529105856.L661@10.0.0.1>
Message-ID: <20070530044158.H93160@delplex.bde.org>
References: <20070529105856.L661@10.0.0.1>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: arch@FreeBSD.org
Subject: Re: rusage breakdown and cpu limits.
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 29 May 2007 20:11:43 -0000

On Tue, 29 May 2007, Jeff Roberson wrote:

> I'm working with Attilio to break down rusage further to be per-thread in 
> places where it is protected by the global scheduler lock.  To support this, 
> I am interested in moving the rlimit cpulimit check into userret(), or 
> perhaps ast().  Is there any reason why we need to check this on every 
> context switch?  Any objections to moving it?  Eventually it will require a 
> different lock from the one we obtain to call mi_switch().

Checking it on every userret() would be a good pessimization, and
scheduling ast()'s to check it occasionally wouldn't be far behind.
Syscalls occur many times more often than context switches.

Locking is needed for the updating the runtime in mi_switch, so it
seems best to check for the runtime becoming too large there while
holding the same lock.  This lock is currently sched_lock and it covers
the following other fields:

Fields in in the proc struct (or attached to it, not counting the
thread struct):

 	p_stats->p_ru.ru_nvcsw
 	p_stats->p_ru.ru_nivcsw
 	p_stats->p_ru.rux_runtime (this)
 	p_stats->p_ru.rux_uticks
 	p_stats->p_ru.rux_iticks
 	p_stats->p_ru.rux_sticks
 	p_cpulimit                (less volatile than the others)
 	p_sflag                   (needs sched_lock more than the others?)
 	p_flag                    (KSE only, invalid? -- needs PROC lock)

Related fields in the pcpu:

 	switchtime
 	switchticks

Related fields in the thread struct:

 	td_uticks
 	td_sticks
 	td_iticks
 	td_flags

Unrelated fields in the thread struct:

 	td_critnest               (in a KASSERT() only)
 	td_owepreempt             (in a KASSERT() only)
 	td_generation
 	td_sched                  (not yet proerly locked)
 	td_priority               (this and following only in CTR*())
 	td_inhibitors
 	td_lockname
 	td_wmesg

I don't see why you emphasized the cpulimit check, since it is the
least delicate part of the rusage maintainance done by mi_check().  I
don't see how the locking requirements can be reduced significantly
by keeping the rusage in the thread struct more of the time (than the
tick counts already are) and only accumulating it into the proc struct
when needed.  There must be a context switch of `switchtime' here, and
this seems to require sched locking.  It doesn't help to accumulate
the runtime into the thread struct instead of into the rusage -- sched
locking still seems to be required to switch the runtime.  The core of
this switch is:

 	new_switchtime = cpu_ticks();
 	delta_runtime = new_switchtime - PCPU_GET(switchtime);
 	PCPU_SET(switchtime, new_switchtime);

and this must be done without letting any other thread run on the CPU.
This seems to require sched_lock.  After determining delta_runtime,
we can add it into either thread data for accumulation later or to
more global data (currently rusage).  This part doesn't need such
strong locking.  We currently add it to rusage since we already hold
the lock for this, and then checking the limit here also has low cost.

Note that in the above code, keeping `switchtime' in the pcpu is an
optimization (for space and perhaps for clarity).  It could be kept
in the thread that owns the CPU.  Then its locking requirements might
be clearer -- it must be switched at the same time as td_oncpu.  The
latter clearly requires sched_lock.  The code for the latter has moved
from mi_switch() sched_switch().  Maybe you emphasized the cpulimit
check because this point wasn't clear -- things that strictly required
sched_lock were supposed to have all been moved to sched_switch()?

p_sflag and td_flag are only needed if we detect that the cpulimit has
been exceeded (to schedule an AST).  Moving the limit check to an AST
would still require sched locking for setting td_flag.

The cpulimit has a low resolution, so it doesn't need to be checked
very often.  The 4BSD scheduler could easily check it every second or
so in the main loop of the scheduler.

Accesses in the CTR*()s and KASSERT()s complicate the analysis and removing
locks -- you don't want to have to acquire a lock just to debug, and the
debugging code also gets in the way of understanding the locking, not to
mention the function.

Bruce