From owner-freebsd-arch@FreeBSD.ORG Thu Mar 13 13:34:54 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 2A1BD106566C; Thu, 13 Mar 2008 13:34:54 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail03.syd.optusnet.com.au (mail03.syd.optusnet.com.au [211.29.132.184]) by mx1.freebsd.org (Postfix) with ESMTP id 933038FC19; Thu, 13 Mar 2008 13:34:53 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c220-239-252-11.carlnfd3.nsw.optusnet.com.au (c220-239-252-11.carlnfd3.nsw.optusnet.com.au [220.239.252.11]) by mail03.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id m2DDYRFk006412 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 14 Mar 2008 00:34:29 +1100 Date: Fri, 14 Mar 2008 00:34:27 +1100 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Jeff Roberson In-Reply-To: <20080312211834.T1091@desktop> Message-ID: <20080313230809.W32527@delplex.bde.org> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: arch@freebsd.org, Peter Wemm , David Xu Subject: Re: amd64 cpu_switch in C. X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Mar 2008 13:34:54 -0000 On Wed, 12 Mar 2008, Jeff Roberson wrote: > On Thu, 13 Mar 2008, Bruce Evans wrote: > >> On Wed, 12 Mar 2008, Peter Wemm wrote: >> >>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu wrote: >>>> Jeff Roberson wrote: >>>> > http://people.freebsd.org/~jeff/amd64.diff >>>> >>>> This is a good idea. >> >> I wouldn't have expected it to make much difference. On i386 UP, >> cpu_switch() normally executes only 48 instructions for in-kernel >> context switches in my version of 5.2 and only 61 instructions in >> -current. ~5.2 differs from 5.2 here in only in not having to >> switch %eflags. This saves 4 instructions but much more in cycles, >> especially in P4 where accesses to %eflags are very slow. 5.2 would >> take 52 instructions, and -current has bloated by 9 instructions >> relative to 5.2. > > More expensive than the raw instruction count is: > > 1) The mispredicted branches to deal with all of the optional state and > features that are not always saved. This is unlikely to matter, and apparently doesn't, at least in simple benchmarks, since the C version has even more branches. Features that are rarely used cause branches that are usually perfectly predicted. > 2) The cost of extra icache for getting over all of those unused > instructions, unaligned jumps, etc. Again, if this were the cause of slowness then it would affect the C version more, since the C version is larger. In fact, the benchmark is probably too simple to show the cost of branches. Just doing sched_yield() in a loop gives the following atypical behaviour which may be atypical enough for the larger branch and cache costs for the C version to not have much effect: - it doesn't go near most of the special cases, so branches are predictable (always non-special) and are thus predicted provided (a) the CPU actually does reasonably good branch prediction, and (b) the branch predictions fit in the branch prediction cache (reasonably good branch prediction probably requires such a cache). - it doesn't touch much icache or dcache or branch-cache, so everything probably stays cached. If just the branch-cache were thrashed, then reasonably good dynamic branch prediction is impossible and things would be slow. In the C version, you use predict_true() and predict_false() a lot. This might improve static branch prediction but makes little difference if the branch cache is working. The C version uses lots of non-inline function calls. Just the branches for this would have a significant overhead if the branches are mispredicted. I think you are depending on gcc's auto-inlining of static functions which are only called once to avoid the full cost of the function calls. > I haven't looked at i386 very closely lately but on amd64 the wrmsrs for > fs/gsbase are very expensive. On my 2ghz dual core opteron the optimized > switch seems to take about 100ns. The total switch from userspace to > userspace is about 4x that. Probably avoiding these is the only significant large between all the versions. You use predict_false() for executing them. Are fsbase and gsbase really usually constant across processes? 400nS is about what I get for i386 on 2.2GHz A64 UP too (6.17 S for ./yield 1000000 10). getpid() on this machine takes 180nS so it is unreasonable to expect sched_yield() to take much less than a few hundred nS. Some perfmon output for ./yield 100000 10: % # s/kx-ls-microarchitectural-resync-by-self-mod-code % 0 % # s/kx-ls-buffer2-full % 909905 % # s/kx-ls-retired-cflush-instructions % 0 % # s/kx-ls-retired-cpuid-instructions % 0 % # s/kx-dc-accesses % 496436422 % # s/kx-dc-misses % 11102024 11 cache dmisses per yield. Probably the main cause of slowness (main memory latency on this machine is 42 nsec so 11 cache misses takes 462 of the 617 nS per call?). % # s/kx-dc-refills-from-l2 % 0 % # s/kx-dc-refills-from-system % 0 % # s/kx-dc-writebacks % 0 % # s/kx-dc-l1-dtlb-miss-and-l2-dtlb-hits % 3459100 % # s/kx-dc-l1-and-l2-dtlb-misses % 2138231 % # s/kx-dc-misaligned-references % 87 % # s/kx-dc-microarchitectural-late-cancel-of-an-access % 73146415 % # s/kx-dc-microarchitectural-early-cancel-of-an-access % 236927303 % # s/kx-bu-cpu-clk-unhalted % 1303921314 % # s/kx-ic-fetches % 236207869 % # s/kx-ic-misses % 22988 Insignificant icache misses. % # s/kx-ic-refill-from-l2 % 18979 % # s/kx-ic-refill-from-system % 4191 % # s/kx-ic-l1-itlb-misses % 0 % # s/kx-ic-l1-l2-itlb-misses % 1619297 % # s/kx-ic-instruction-fetch-stall % 1034570822 % # s/kx-ic-return-stack-hit % 20822416 % # s/kx-ic-return-stack-overflow % 5870 % # s/kx-fr-retired-instructions % 701240247 % # s/kx-fr-retired-ops % 1163464391 % # s/kx-fr-retired-branches % 121636370 % # s/kx-fr-retired-branches-mispredicted % 2761910 % # s/kx-fr-retired-taken-branches % 93488548 % # s/kx-fr-retired-taken-branches-mispredicted % 2848315 2.8 branches mispredicted per call. # s/kx-fr-retired-far-control-transfers % 2000934 1 int0x80 and 1 iret per shched_yield(), and apparentlty not much else. % # s/kx-fr-retired-resync-branches % 936968 % # s/kx-fr-retired-near-returns % 19008374 % # s/kx-fr-retired-near-returns-mispredicted % 784103 0.8 returns mispredicted per call. % # s/kx-fr-retired-taken-branches-mispred-by-addr-miscompare % 721241 % # s/kx-fr-interrupts-masked-cycles % 658462615 Ugh, this is from spinlocks bogusly masking interrupts. More than half the cycles have interrupts masked. This at least shows that lots of time is being spent near cpu_switch() with a spinlock held. % # s/kx-fr-interrupts-masked-while-pending-cycles % 9365 Since the CPU is reasonably fast, interrupts aren't masked for very long each time. This maximum is still 4.5 uS. % # s/kx-fr-hardware-interrupts % 63 % # s/kx-fr-decoder-empty % 247898696 % # s/kx-fr-dispatch-stalls % 589228741 % # s/kx-fr-dispatch-stall-from-branch-abort-to-retire % 39894120 % # s/kx-fr-dispatch-stall-for-serialization % 44037193 % # s/kx-fr-dispatch-stall-for-segment-load % 134520281 134 cyles per call. This may be more for ones in syscall() generally. I think each segreg load still costs ~20 cycles. Since this is on i386, there are 6 per call (%ds, %es and %fs save and restore), plus %ss save and which might not be counted here. 134 is a lot -- about 60nS of the 180nS for getpid(). % # s/kx-fr-dispatch-stall-when-reorder-buffer-is-full % 18648001 % # s/kx-fr-dispatch-stall-when-reservation-stations-are-full % 121485247 % # s/kx-fr-dispatch-stall-when-fpu-is-full % 19 % # s/kx-fr-dispatch-stall-when-ls-is-full % 203578275 % # s/kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet % 63136307 % # s/kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending % 6994131 >> In-kernel switches are not a very typical case since they don't load >> %cr3... > > We've been working on amd64 so I can't comment specifically about i386 costs. > However, I definitely agree that cpu_switch() is not the greatest overhead in > the path. Also, you have to load cr3 even for kernel threads because the > page directory page or page directory pointer table at %cr3 can go away once > you've switched out the old thread. I don't see this. The switch is avoided if %cr3 wouldn't change, which I think usually or always happens for switches between kernel threads. >> The asm code already saves only call-saved registers for both i386 and >> amd64. It saves call-saved registers even when it apparently doesn't >> use them (lots more of these on amd64, while on i386 it uses more >> call-saved registers than it needs to, apparently since this is free >> after saving all call-saved registers). I think saving more than is >> needed is the result of confusion about what needs to be saved and/or >> what is needed for debugging. > > It has to save all of the callee saved registers in the PCB because they will > likely differ from thread to thread. Failing to save and restore them could > leave you returning with the registers having different values and corrupt > the calling function. Yes, I had forgotten the detail of how the non-local flow of control can change the registers (the next call to the function in the context of the switched-to-process may have different values in the registers due to changes to the registers in callers). All that can be done differently here is saving all the registers on the stack (except %esp) in the usual way. This would probably be faster on old i386's using pushal or pushl, but on amd64 pushal is not available, and on Athlons generally (before Barcelona?) it is faster not to use pushl, so on amd64 the registers should be saved using movl and then it is just as easy to put them in the pcb as on the stack. >>> The good news is that this tuning is finally being done. It should >>> have been done in 2003 though... >> >> How is this possible with (according to my theory) most of the context >> switch cost being for %cr3 and upper layers? Unchanged amd64 has only >> a few more costs than i386. Mainly 3 unconditional wrmsr's and 2 >> unconditional rdmsr's for managing gsbase and fsbase. I thought that >> these were hard to avoid and anyway not nearly as expensive as %cr3 loads. > > %cr3 is actually a lot less expensive these days with page table flush > filters and the PG_G bit. We were able to optimize away setting the msrs in > the case that the previous values match the new values. Apparently the > hardware doesn't optimize this case so we have to do comparisons ourselves. > > That was a big chunk of the optimization. Static branch hints, reordering > code, possibly reordering for better pipeline scheduling in peter's asm, etc. > provide the rest. All the old i386 asm and probably clones of it on amd64 is certainly not optimized globally for anything newer than an i386 (barely even an i486). This rarely matters however. It lost more on Pentium-1's, but now out of order execution and better branch prediction hides most inefficiencies. Bruce