From owner-freebsd-arch@FreeBSD.ORG  Sun Dec  2 11:04:24 2007
Return-Path: <owner-freebsd-arch@FreeBSD.ORG>
Delivered-To: arch@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 75C9A16A419
	for <arch@freebsd.org>; Sun,  2 Dec 2007 11:04:24 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 2ECF013C447
	for <arch@freebsd.org>; Sun,  2 Dec 2007 11:04:24 +0000 (UTC)
	(envelope-from rwatson@FreeBSD.org)
Received: from fledge.watson.org (fledge.watson.org [209.31.154.41])
	by cyrus.watson.org (Postfix) with ESMTP id 3DA4447063;
	Sun,  2 Dec 2007 05:52:32 -0500 (EST)
Date: Sun, 2 Dec 2007 10:47:55 +0000 (GMT)
From: Robert Watson <rwatson@FreeBSD.org>
X-X-Sender: robert@fledge.watson.org
To: Poul-Henning Kamp <phk@phk.freebsd.dk>
In-Reply-To: <17366.1196583284@critter.freebsd.dk>
Message-ID: <20071202103833.N74097@fledge.watson.org>
References: <17366.1196583284@critter.freebsd.dk>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: Attilio Rao <attilio@freebsd.org>, arch@freebsd.org
Subject: Re: New "timeout" api, to replace callout 
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
	<mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 02 Dec 2007 11:04:24 -0000

On Sun, 2 Dec 2007, Poul-Henning Kamp wrote:

> In message <3bbf2fe10712012231p2945111cma2faed2299167d3a@mail.gmail.com>, "Atti
> lio Rao" writes:
>> 2007/12/1, Poul-Henning Kamp <phk@phk.freebsd.dk>:
>>>
>>> Here is my proposed new timeout API for 8.x.
>>>
>>> The primary objective is to make it possible to have multiple timeout 
>>> "providers" of possibly different kind, so that we can have per-cpu or 
>>> per-net-stack timeout handing.
>>
>> I have a question so.
>
> I have no idea what the answer to your question is, I'm focusing on 
> providing the ability, how we subsequently decide to use it is up to others.

Well, I think there is an important question to be discussed regarding 
combinatorics, context switching, and the ability to provide multiple callout 
threads.  People have found the facility to provide their own worker threads 
and work pools surprisingly useful for taskqueue(9), so I find the concept of 
providing seperate callout wheels for different sorts of work appealing -- we 
could group, for example, high priority callouts in a separate thread from low 
priority callouts, avoiding priority inversion scenarions where high priority 
callouts in effect wait for low priority callouts due to the scheduling that 
occurs in callout(9) processing.  However, this leads to a few concerns:

- If we have several wheels in several threads, we risk significantly
   increasing the level of context switching if callouts exist in multiple
   wheels that fire at the same time intervals and same offsets.  Today, those
   "context switches" occur in a single thread and don't require interacting
   with the system scheduler, saving a full stack, etc, and are effectively
   make callout handlers into co-routines.

- There has been quite a bit of discussion about effectively slapping
   [MAXCPUS] onto the current callout wheel and lock, and starting up a callout
   thread per-CPU in order to allow workloads to be load-balanced.  If no CPU
   preference is specified, then it lands on CPU 0 (or the like), and otherwise
   a consumer can request a preference to run the callout on a specific CPU.
   Good reasons to do this include avoiding lock contention by introducing
   affinities for workload, and load balancing for heavy callout users.  I
   specifically have TCP in mind, needless to say, and it is one of our largest
   callout consumers.  How would this strategy play out in the new
   infrastructure -- are you proposing TCP establish a thread and a group for
   each CPU, or is that a facility (affinity/CPU binding) that the timeout
   facility will provide for it, allowing TCP simply to express a CPU
   preference for a timeout when registering or rescheduling it?

- For more naive users of the timeout facility, do you have any thinking on
   how we might load balance the timeouts as part of the facility you are
   designing?  On busy systems, the callout thread can become quite a CPU hog,
   and it could be that transparent load balancing offers a benefit for
   consumers that are not aware of how to do their own load balancing.  FWIW, I
   believe that in cases where we have a non-naive consumer, there are
   significant benefits to allowing it to manage its own balancing, as it can
   take into account data affinities, the potential for lock contention, etc.

I have plans in the early 8.x development cycle to break down the pcbinfo 
locks and start balancing TCP work across CPUs via a weak affinity model 
(processing can happen on other CPUs, but we prefer not to for reasons of lock 
contention, cache cleanliness, etc).  This in practice should also mean 
assigning the callouts for a TCP connection to run on the CPU it has an 
affinity for, for exactly the same reasons.  This means that, one way or 
another, I need the ability to do this in the next three months, and I want to 
make sure that these plans are compatible with, and ideally facilitated by, 
any reworking of the callout facility.

Robert N M Watson
Computer Laboratory
University of Cambridge