From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 06:22:52 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9BEC6106566C; Thu, 13 Mar 2008 06:22:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from fallbackmx07.syd.optusnet.com.au (fallbackmx07.syd.optusnet.com.au [211.29.132.9]) by mx1.freebsd.org (Postfix) with ESMTP id E4B8C8FC25; Thu, 13 Mar 2008 06:22:46 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail18.syd.optusnet.com.au (mail18.syd.optusnet.com.au [211.29.132.199]) by fallbackmx07.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2D2m0Qi018588; Thu, 13 Mar 2008 13:48:00 +1100 Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail18.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2D2lut1025522 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Thu, 13 Mar 2008 13:47:58 +1100 Date: Thu, 13 Mar 2008 13:47:56 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Peter Wemm In-Reply-To: Message-ID: <20080313124213.J31200@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 06:22:52 -0000 On Wed, 12 Mar 2008, Peter Wemm wrote: > On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >> Jeff Roberson wrote: >> > http://people.freebsd.org/~jeff/amd64.diff >> >> This is a good idea. I wouldn't have expected it to make much difference. On i386 UP, cpu_switch() normally executes only 48 instructions for in-kernel context switches in my version of 5.2 and only 61 instructions in -current. ~5.2 differs from 5.2 here in only in not having to switch %eflags. This saves 4 instructions but much more in cycles, especially in P4 where accesses to %eflags are very slow. 5.2 would take 52 instructions, and -current has bloated by 9 instructions relative to 5.2. In-kernel switches are not a very typical case since they don't load %cr3. The 50-60 instructions might take as few as 20 cycles when pipelined through 3 ALUs, but they are only moderately parallelizable so would take more like 50-60 cycles on an Athlon. The only very slow instructions in them for the usual in-kernel case are the loads of %eflags and %gs. At least the latter is easy-to optimize away, but the former is assoicated with spin locking hard-disabling interrupts. For userland context switches, there is also an ltr in the usual path of execution. But 100 or so cycles for the simple instructions is noise compared with the cost of the TLB flush and other cache misses caused by loading %cr3 for userland context switches. Userland code that does useful work will do more than sched_yield() so it will suffer more from cache misses. Layers above cpu_switch() has become very bloated and make a full context switch take several hundred cycles for the simple instructions on machines where the simple instructions in cpu_switch() take only 100. Its overhead may almost be signficant relative to the cache misses. However, this is another reason why the speed of the simple instructions in cpu_switch() doesn't matter. >> In fact, according to calling conversion, some >> registers are not needed to be saved across function call, e.g on >> i386, eax, edx, and ecx. :-) but gdb may need them to dig out >> stack variable's value. The asm code already saves only call-saved registers for both i386 and amd64. It saves call-saved registers even when it apparently doesn't use them (lots more of these on amd64, while on i386 it uses more call-saved registers than it needs to, apparently since this is free after saving all call-saved registers). I think saving more than is needed is the result of confusion about what needs to be saved and/or what is needed for debugging. > Jeff and I have been having a friendly "competition" today. > > With a UP kernel and INVARIANTS, my initial counter-patch response had > nearly double the gain on my machine. (Jeff 7%, mine: 13.5%). > I changed to compile kernels the same as he did (no invariants, SMP > kernel, but kern.smp.disabled=1). After that, our patch sets were the > same again - both at about 10% gain over baseline. > > I've made a few more changes and am now at 23% improvement over baseline. > > I'm not confident of testing methodology. More tests are in progress. > > The good news is that this tuning is finally being done. It should > have been done in 2003 though... How is this possible with (according to my theory) most of the context switch cost being for %cr3 and upper layers? Unchanged amd64 has only a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 unconditional rdmsr's for managing gsbase and fsbase. I thought that these were hard to avoid and anyway not nearly as expensive as %cr3 loads. Bruce