From owner-freebsd-performance@FreeBSD.ORG  Wed Jun 14 06:25:47 2006
Return-Path: <owner-freebsd-performance@FreeBSD.ORG>
X-Original-To: freebsd-performance@freebsd.org
Delivered-To: freebsd-performance@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 0FCB016A474;
	Wed, 14 Jun 2006 06:25:47 +0000 (UTC) (envelope-from bde@zeta.org.au)
Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 3385F43D46;
	Wed, 14 Jun 2006 06:25:45 +0000 (GMT) (envelope-from bde@zeta.org.au)
Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au
	[61.8.2.163])
	by mailout1.pacific.net.au (Postfix) with ESMTP id 2EEB9527FD4;
	Wed, 14 Jun 2006 16:22:55 +1000 (EST)
Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246])
	by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3sarge1) with ESMTP
	id k5E6MkkG030715; Wed, 14 Jun 2006 16:22:48 +1000
Date: Wed, 14 Jun 2006 16:22:46 +1000 (EST)
From: Bruce Evans <bde@zeta.org.au>
X-X-Sender: bde@epsplex.bde.org
To: kmacy@fsmware.com
In-Reply-To: <b1fa29170606132015p654e2877s1ec1da6184ce672e@mail.gmail.com>
Message-ID: <20060614133024.E1753@epsplex.bde.org>
References: <20060612195754.72452.qmail@web33306.mail.mud.yahoo.com>
	<20060612210723.K26068@fledge.watson.org>
	<20060612203248.GA72885@xor.obsecurity.org>
	<200606130715.52425.davidxu@freebsd.org>
	<20060613105930.N34121@fledge.watson.org>
	<b1fa29170606132015p654e2877s1ec1da6184ce672e@mail.gmail.com>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Scott Long <scottl@samsco.org>, kmacy@freebsd.org, Paul Saab <ps@mu.org>,
	Robert Watson <rwatson@freebsd.org>, David Xu <davidxu@freebsd.org>,
	Kris Kennaway <kris@obsecurity.org>,
	freebsd-performance@freebsd.org, danial_thom@yahoo.com
Subject: Re: Initial 6.1 questions
X-BeenThere: freebsd-performance@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Performance/tuning <freebsd-performance.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-performance>
List-Post: <mailto:freebsd-performance@freebsd.org>
List-Help: <mailto:freebsd-performance-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-performance>,
	<mailto:freebsd-performance-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 14 Jun 2006 06:25:47 -0000

On Tue, 13 Jun 2006, Kip Macy wrote:

> ...
> Why do I say "non-interrupt blocking?". Currently we have roughly a
> half dozen locking primitives. The two that I am familiar with are
> blocking and spinning mutexes. The general policy is to use blocking
> locks except where a lock is used in interrupts or the scheduler. It
> seems to me that in the scheduler interrupts only actually need to be
> blocked across cpu_switch. Spin locks obviously have to be used
> because a thread cannot very well context switch while its in the
> middle of context switching - however, provided td_critnest > 0, there
> is no reason that interrupts need to be blocked. Currently sched_lock
> is acquired in cpu_hardclock and statclock - so it does need to block
> interrupts. There is no reason that these two functions couldn't be
> run in ast().

These functions are called from "fast" interrupt handlers, so they
cannot use sleep locks.  They also cannot be run in ast(), since ast()
is only run on return to user mode and uses sleep locks a lot.  Gathering
of some user-mode statistics could be deferred until return to user
mode, but this wouldn't work for kernel-mode statistics, which is never
for threads that never leave the kernel, and large changes would be
required for the user-mode statistics: algorithmic changes: various,
mainly to keep kernel-mode separate; locking: ast() uses sched_lock,
so without large changes you would just move the problem (there would
be up to hz + stathz extra calls to ast() per second); the statistics
fields are all locked by sched_lock, and although this would not be
needed for access in ast() some locking would still be needed for many
which are accessed from elsewhere).

What they (and all fast interrupt handlers or even "fast" interrupt
handlers) can do better is use spin locks != sched_lock (and for fast
interrupt handlers, != mtx_lock_spin(any)).  This is not easy to do
in general, and is especially difficult for clock interrupt handlers,
because all accesses to data accessed by a fast interrupt handler must
be locked by a common lock (especially outside of the handlers) and
clock interrupt handlers access a lot of data.  Currently, clock
interrupt handlers use sched_lock and depend on sched_lock being used
too much so that most of the data accessed by clock interrupt handlers
is locked automatically.  Even then, there are large gaps in the locking.
E.g., hardclock() starts by calling tc_ticktock() which mostly uses
very delicate time-domain locking but sometimes races with syscalls
that use sleep locking, most frequently by calling ntp_update_second().
Most of kern_ntptime.c is documented (in comments) as being required
to run at splclock() or higher, but it is actually all locked only by
Giant, so sched_lock'ing and other spinlocking for it is neither
necessary or sufficient, and calling it correctly from a "fast" interrupt
handler is impossible.

In my kernel, fast interrupt handlers (and associated non-handler code
that shares data) are actually fast (== low-latency &&
!(very-large-footprint || takes-very-long)).  This requires:
- mtx_lock_spin() to not mask interrupts, since masking interrupts gives
   !low-latency at least in the UP case.
- fast interrupt handlers to not use sched_lock, since sched_lock gives
   very-large-footprint.
- fast interrupt handlers to not use only mtx_lock_spin(), since that no
   longer masks them.  My implementation actually uses simple_locks plus
   explicit per-cpu interrupt disabling (as in RELENG_4).  This also avoids
   having to turn off features like WITNESS and KTR which don't honor the
   rules for fast interrupt handlers.
- fast interrupt handlers to not use normal scheduling (things like
   swi_sched()), since that uses sched_lock and is generally very
   inefficient.  My implementation uses a combination of timeouts
   and a hack to metamorphose into a SWI handler.  The latter is a
   very expensive operation and should be avoided.  swi_sched() encourages
   this inefficiency except in the SWI_DELAY case.  The SWI_DELAY case
   only takes 50-100 times as many instructions as corresponding
   scheduling in RELENG_4.  SWI_DELAY seems to be unused except in
   my drivers.  My implementation enforces non-use of normal scheduling
   and some other invalid data accesses (e.g., to curthread) unmapping
   PCPU data in fast interrupt handlers.
- clock interrupt handlers to not be fast interrupt handlers.  They
   have far too large a footprint to be fast interrupt handlers.  Locking
   them is hard enough when they are only "fast" interrupt handlers.
   I made them normal interrupt handlers and don't support "fast" interrupt
   handlers.

I get very few benefits from this.  Normal interrupt handlers for
clocks are inefficient.  They don't take very long, but switching to
them is inefficient.  I get lower interrupt latency, but this is
not very important now that CPUs are very fast compared with i/o
for all devices that I have.  I get the possibility of simpler
locking in clock interrupt handlers, but haven't simplified or fixed
their locking.  I get enforced smallness and complexity for fast
interrupt handlers since large ones would be too complicated and
normal scheduling and locking cannot be used.

Bruce