Date: Fri, 14 Mar 2008 13:59:46 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Jeff Roberson <jroberson@chesapeake.net> Cc: arch@freebsd.org, Peter Wemm <peter@wemm.org>, David Xu <davidxu@freebsd.org> Subject: Re: amd64 cpu_switch in C. Message-ID: <20080314132033.I34431@delplex.bde.org> In-Reply-To: <20080313132152.Y1091@desktop> References: <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <e7db6d980803120125y41926333hb2724ecd07c0ac92@mail.gmail.com> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org> <20080313132152.Y1091@desktop>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 13 Mar 2008, Jeff Roberson wrote: Please trim quotes more. > On Fri, 14 Mar 2008, Bruce Evans wrote: > >> On Wed, 12 Mar 2008, Jeff Roberson wrote: >>> More expensive than the raw instruction count is: >>> >>> 1) The mispredicted branches to deal with all of the optional state and >>> features that are not always saved. >> >> This is unlikely to matter, and apparently doesn't, at least in simple >> benchmarks, since the C version has even more branches. Features that >> are rarely used cause branches that are usually perfectly predicted. > > The c version has two fewer branches because it tests for two unlikely > features together. It has a few more branches than the in cvs asm version > and the same number of extra branches as peter's asm version to support > conditional gs/fsbase setting. The other extra branches have to do with > supporting cpu_switch() and cpu_throw() together. Testing features together is probably best here, but it might not always be. Execution more branches might be faster because each individual branch is easier to predict. >>> 2) The cost of extra icache for getting over all of those unused >>> instructions, unaligned jumps, etc. >> >> Again, if this were the cause of slowness then it would affect the C >> version more, since the C version is larger. > > The C version is not larger than the asm version at high optimization levels > when you consider the total instruction count that is brought into the > icache. It's worth noting that my C version is slower in some cases other > than the microbenchmark due to extra instructions for optimizations that > don't matter. Peter's asm version is tight enough that the extra compares > don't cost more than the compacted code wins. The C version touches more > distinct icache lines but makes up for it in other optmiizations in the > common case. Are calls to rarely-called functions getting auto-inlined for your C version? THe asm version doesn't worry about this. Even with auto-inlining of static functions that are only called once (a new bugfeature in gcc-4.1 which breaks profiling and debugging), at some optimization levels gcc will place code for the unusual case far away so as not to pollute the i-cache in the usual case although this may cost an extra branch in the unusual case. For rarely-called functions, it must be better to not inline too. >> In fact, the benchmark is probably too simple to show the cost of >> branches. Just doing sched_yield() in a loop gives the following >> atypical behaviour which may be atypical enough for the larger branch >> and cache costs for the C version to not have much effect: >> - it doesn't go near most of the special cases, so branches are >> predictable (always non-special) and are thus predicted provided >> (a) the CPU actually does reasonably good branch prediction, and >> (b) the branch predictions fit in the branch prediction cache >> (reasonably good branch prediction probably requires such a >> cache). > > This cache is surely virtual as it happens in the first few stages of the > pipeline. That means it's flushed on every switch. We're probably coming in > cold every time. Which cache? My perfmon results show that the branch cache is far from cold. >> The C version uses lots of non-inline function calls. Just the >> branches for this would have a significant overhead if the branches >> are mispredicted. I think you are depending on gcc's auto-inlining >> of static functions which are only called once to avoid the full >> cost of the function calls. > > I depend on it not inlining them to avoid polluting the icache with unused > instructions. I broke that with my most recent patch by moving the calls > back into C. :-) Maybe I only looked at the most recent patch. It seemed to have lots of calls. To prevent inlining you probably need to use the noinline attribute for some functions. I don't see how the C version can be both simpler and (as|more) optimal than the asm version. It already has magic somewhat self-documenting macros for branch prediction and magic undocumented layout for the function calls etc. to improve branch prediction and icache use. For even-more-micro optimizations in libm, I try to do everything in C, but the only way I can get near the efficiency that I want is to look at the asm output and then figure out how to trick the compiler into not being so stupid. I could optimize it in asm with less work (starting with the asm output, especially at first to learn what works for SSE scheduling), but only for a single CPU type. >> Some perfmon output for ./yield 100000 10: >> ... >> % # s/kx-fr-dispatch-stall-for-segment-load % 134520281 >> >> 134 cyles per call. This may be more for ones in syscall() generally. >> I think each segreg load still costs ~20 cycles. Since this is on >> i386, there are 6 per call (%ds, %es and %fs save and restore), plus >> %ss save and which might not be counted here. 134 is a lot -- about >> 60nS of the 180nS for getpid(). I forgot about parallelism. With 3-way execution on an Athlon, there is at least a chance that all 3 segment registers are loaded in parallel, taking only ~20 cycles for all 3, but no chance of proceeding with other instructions if so. OTOH, if only 1 or 2 ALUs can do segreg loads, then the other ALUs may be able to proceed with independent instructions. We have some nearby instructions that depend on %ds (these might benefit from using %ss) but few or no nearby dependencies on %es and %fs. Kernel code mostly doesn't worry about dependencies at all. Dependencies don't matter as much in integer code as in SSE/FPU code. >>> We've been working on amd64 so I can't comment specifically about i386 >>> costs. However, I definitely agree that cpu_switch() is not the greatest >>> overhead in the path. Also, you have to load cr3 even for kernel threads >>> because the page directory page or page directory pointer table at %cr3 >>> can go away once you've switched out the old thread. >> >> I don't see this. The switch is avoided if %cr3 wouldn't change, which >> I think usually or always happens for switches between kernel threads. > > I see, you're saying 'between kernel threads'. There was some discussion of > allowing kernel threads to use the page tables of whichever thread was last > switched in to avoid cr3 in all cases for them. This requires other changes > to be safe however. Probably a good idea. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080314132033.I34431>