From owner-freebsd-arch@FreeBSD.ORG Sun Dec 2 11:04:24 2007 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 75C9A16A419 for ; Sun, 2 Dec 2007 11:04:24 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from cyrus.watson.org (cyrus.watson.org [209.31.154.42]) by mx1.freebsd.org (Postfix) with ESMTP id 2ECF013C447 for ; Sun, 2 Dec 2007 11:04:24 +0000 (UTC) (envelope-from rwatson@FreeBSD.org) Received: from fledge.watson.org (fledge.watson.org [209.31.154.41]) by cyrus.watson.org (Postfix) with ESMTP id 3DA4447063; Sun, 2 Dec 2007 05:52:32 -0500 (EST) Date: Sun, 2 Dec 2007 10:47:55 +0000 (GMT) From: Robert Watson X-X-Sender: robert@fledge.watson.org To: Poul-Henning Kamp In-Reply-To: <17366.1196583284@critter.freebsd.dk> Message-ID: <20071202103833.N74097@fledge.watson.org> References: <17366.1196583284@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Attilio Rao , arch@freebsd.org Subject: Re: New "timeout" api, to replace callout X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 02 Dec 2007 11:04:24 -0000 On Sun, 2 Dec 2007, Poul-Henning Kamp wrote: > In message <3bbf2fe10712012231p2945111cma2faed2299167d3a@mail.gmail.com>, "Atti > lio Rao" writes: >> 2007/12/1, Poul-Henning Kamp : >>> >>> Here is my proposed new timeout API for 8.x. >>> >>> The primary objective is to make it possible to have multiple timeout >>> "providers" of possibly different kind, so that we can have per-cpu or >>> per-net-stack timeout handing. >> >> I have a question so. > > I have no idea what the answer to your question is, I'm focusing on > providing the ability, how we subsequently decide to use it is up to others. Well, I think there is an important question to be discussed regarding combinatorics, context switching, and the ability to provide multiple callout threads. People have found the facility to provide their own worker threads and work pools surprisingly useful for taskqueue(9), so I find the concept of providing seperate callout wheels for different sorts of work appealing -- we could group, for example, high priority callouts in a separate thread from low priority callouts, avoiding priority inversion scenarions where high priority callouts in effect wait for low priority callouts due to the scheduling that occurs in callout(9) processing. However, this leads to a few concerns: - If we have several wheels in several threads, we risk significantly increasing the level of context switching if callouts exist in multiple wheels that fire at the same time intervals and same offsets. Today, those "context switches" occur in a single thread and don't require interacting with the system scheduler, saving a full stack, etc, and are effectively make callout handlers into co-routines. - There has been quite a bit of discussion about effectively slapping [MAXCPUS] onto the current callout wheel and lock, and starting up a callout thread per-CPU in order to allow workloads to be load-balanced. If no CPU preference is specified, then it lands on CPU 0 (or the like), and otherwise a consumer can request a preference to run the callout on a specific CPU. Good reasons to do this include avoiding lock contention by introducing affinities for workload, and load balancing for heavy callout users. I specifically have TCP in mind, needless to say, and it is one of our largest callout consumers. How would this strategy play out in the new infrastructure -- are you proposing TCP establish a thread and a group for each CPU, or is that a facility (affinity/CPU binding) that the timeout facility will provide for it, allowing TCP simply to express a CPU preference for a timeout when registering or rescheduling it? - For more naive users of the timeout facility, do you have any thinking on how we might load balance the timeouts as part of the facility you are designing? On busy systems, the callout thread can become quite a CPU hog, and it could be that transparent load balancing offers a benefit for consumers that are not aware of how to do their own load balancing. FWIW, I believe that in cases where we have a non-naive consumer, there are significant benefits to allowing it to manage its own balancing, as it can take into account data affinities, the potential for lock contention, etc. I have plans in the early 8.x development cycle to break down the pcbinfo locks and start balancing TCP work across CPUs via a weak affinity model (processing can happen on other CPUs, but we prefer not to for reasons of lock contention, cache cleanliness, etc). This in practice should also mean assigning the callouts for a TCP connection to run on the CPU it has an affinity for, for exactly the same reasons. This means that, one way or another, I need the ability to do this in the next three months, and I want to make sure that these plans are compatible with, and ideally facilitated by, any reworking of the callout facility. Robert N M Watson Computer Laboratory University of Cambridge