From owner-freebsd-net@FreeBSD.ORG Fri Oct 14 22:28:14 2005 Return-Path: X-Original-To: net@freebsd.org Delivered-To: freebsd-net@FreeBSD.ORG Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id A2E2416A41F for ; Fri, 14 Oct 2005 22:28:14 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailout1.pacific.net.au (mailout1.pacific.net.au [61.8.0.84]) by mx1.FreeBSD.org (Postfix) with ESMTP id 0733C43D45 for ; Fri, 14 Oct 2005 22:28:13 +0000 (GMT) (envelope-from bde@zeta.org.au) Received: from mailproxy2.pacific.net.au (mailproxy2.pacific.net.au [61.8.0.87]) by mailout1.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9EMS4SC003614; Sat, 15 Oct 2005 08:28:04 +1000 Received: from epsplex.bde.org (katana.zip.com.au [61.8.7.246]) by mailproxy2.pacific.net.au (8.13.4/8.13.4/Debian-3) with ESMTP id j9EMS2cY017151; Sat, 15 Oct 2005 08:28:03 +1000 Date: Sat, 15 Oct 2005 08:28:04 +1000 (EST) From: Bruce Evans X-X-Sender: bde@epsplex.bde.org To: Poul-Henning Kamp In-Reply-To: <12907.1129286370@critter.freebsd.dk> Message-ID: <20051015074316.T1260@epsplex.bde.org> References: <12907.1129286370@critter.freebsd.dk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Garrett Wollman , Andrew Gallatin , net@freebsd.org Subject: Re: Call for performance evaluation: net.isr.direct (fwd) X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 Oct 2005 22:28:14 -0000 On Fri, 14 Oct 2005, Poul-Henning Kamp wrote: > In message <20051014192509.F80520@delplex.bde.org>, Bruce Evans writes: >> The timestamps in mi_switch() are taken on the same CPU and only their >> differences are used, so they don't even need to be synced. It they >> use the TSC, then the TSCs just need to have the same almost-constant >> frequency (or different almost-constant frequencies if timecounters >> werre per-CPU). > > Actually, I think we need to go back a step further. > > The task of the scheduler is to hand out a finite resource according > to a set policy. > > The finite resource is "instructions executed by a CPU". > > It used to be that CPUs ran at constant clock rates, and therefore > implementors made the simplifying assumption that > > instructions = a * time > > for some random but constant a and made their scheduling decisions > based on time. This is currently moot for p_runtime. p_runtime is not used for at least kernel scheduling. It is only used by userland (mostly for users to look at?). Schedulers use only ticks set periodically by sched_clock(). They should use p_runtime, given that we already pay the enormous cost of setting it on every normal interrupt. > Today CPUs do not run on constant rates but they have counters which > count the number of instruction cycles. Therefore talking about > computer effort in terms of "CPU second" is like selling rubber > band by the inch. > > The scheduler has a side job of accounting for CPU usage and the > API for accesing this info has unfortunately been specified in > terms of time rather than instructions. I disagree. Time is the only useful metric for users, and scheduling is fuzzy so it doesn't really care. Scheduling needs an approximation resource usage that can be obtained very efficiently. Its tick counts are very efficient and are precise enough even with a 100Hz period, but they aren't accurate enough since applications can hide from statistics clock ticks either accidentallly or intentionally. statclock was supposed to fix this but a never really did, especially with too-large values of HZ like 1000 -- with hz > stathz it is easy to use a periodic itimer to arrange to run about (hz - stathz) / hz of the time without ever seeing a statclock tick. Bruce