Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Mar 2008 13:36:47 -1000 (HST)
From:      Jeff Roberson <jroberson@chesapeake.net>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        arch@freebsd.org, David Xu <davidxu@freebsd.org>, Peter Wemm <peter@wemm.org>
Subject:   Re: amd64 cpu_switch in C.
Message-ID:  <20080313132152.Y1091@desktop>
In-Reply-To: <20080313230809.W32527@delplex.bde.org>
References:  <20080310161115.X1091@desktop> <47D758AC.2020605@freebsd.org> <e7db6d980803120125y41926333hb2724ecd07c0ac92@mail.gmail.com> <20080313124213.J31200@delplex.bde.org> <20080312211834.T1091@desktop> <20080313230809.W32527@delplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help

On Fri, 14 Mar 2008, Bruce Evans wrote:

> On Wed, 12 Mar 2008, Jeff Roberson wrote:
>
>> On Thu, 13 Mar 2008, Bruce Evans wrote:
>> 
>>> On Wed, 12 Mar 2008, Peter Wemm wrote:
>>> 
>>>> On Tue, Mar 11, 2008 at 9:14 PM, David Xu <davidxu@freebsd.org> wrote:
>>>>> Jeff Roberson wrote:
>>>>> > http://people.freebsd.org/~jeff/amd64.diff
>>>>>
>>>>>  This is a good idea.
>>> 
>>> I wouldn't have expected it to make much difference.  On i386 UP,
>>> cpu_switch() normally executes only 48 instructions for in-kernel
>>> context switches in my version of 5.2 and only 61 instructions in
>>> -current.  ~5.2 differs from 5.2 here in only in not having to
>>> switch %eflags.  This saves 4 instructions but much more in cycles,
>>> especially in P4 where accesses to %eflags are very slow.  5.2 would
>>> take 52 instructions, and -current has bloated by 9 instructions
>>> relative to 5.2.
>> 
>> More expensive than the raw instruction count is:
>> 
>> 1)  The mispredicted branches to deal with all of the optional state and 
>> features that are not always saved.
>
> This is unlikely to matter, and apparently doesn't, at least in simple
> benchmarks, since the C version has even more branches.  Features that
> are rarely used cause branches that are usually perfectly predicted.

The c version has two fewer branches because it tests for two unlikely 
features together.  It has a few more branches than the in cvs asm version 
and the same number of extra branches as peter's asm version to support 
conditional gs/fsbase setting.  The other extra branches have to do with 
supporting cpu_switch() and cpu_throw() together.

>
>> 2)  The cost of extra icache for getting over all of those unused 
>> instructions, unaligned jumps, etc.
>
> Again, if this were the cause of slowness then it would affect the C
> version more, since the C version is larger.

The C version is not larger than the asm version at high optimization 
levels when you consider the total instruction count that is brought into 
the icache.  It's worth noting that my C version is slower in some cases 
other than the microbenchmark due to extra instructions for optimizations 
that don't matter.  Peter's asm version is tight enough that the extra 
compares don't cost more than the compacted code wins.  The C version 
touches more distinct icache lines but makes up for it in other 
optmiizations in the common case.

>
> In fact, the benchmark is probably too simple to show the cost of
> branches.  Just doing sched_yield() in a loop gives the following
> atypical behaviour which may be atypical enough for the larger branch
> and cache costs for the C version to not have much effect:
> - it doesn't go near most of the special cases, so branches are
>  predictable (always non-special) and are thus predicted provided
>  (a) the CPU actually does reasonably good branch prediction, and
>  (b) the branch predictions fit in the branch prediction cache
>      (reasonably good branch prediction probably requires such a
>      cache).

This cache is surely virtual as it happens in the first few stages of the 
pipeline.  That means it's flushed on every switch.  We're probably coming 
in cold every time.

> - it doesn't touch much icache or dcache or branch-cache, so
>  everything probably stays cached.
>
> If just the branch-cache were thrashed, then reasonably good dynamic
> branch prediction is impossible and things would be slow.  In the C
> version, you use predict_true() and predict_false() a lot.  This
> might improve static branch prediction but makes little difference
> if the branch cache is working.

I doubt there are any cases where the branch cache is effective here.  I 
don't know that for certain but it seems unlikely that it would be 
preserved across switches due to the complexity in validating addresses.

>
> The C version uses lots of non-inline function calls.  Just the
> branches for this would have a significant overhead if the branches
> are mispredicted.  I think you are depending on gcc's auto-inlining
> of static functions which are only called once to avoid the full
> cost of the function calls.

I depend on it not inlining them to avoid polluting the icache with unused 
instructions.  I broke that with my most recent patch by moving the calls 
back into C.

>
>> I haven't looked at i386 very closely lately but on amd64 the wrmsrs for 
>> fs/gsbase are very expensive.  On my 2ghz dual core opteron the optimized 
>> switch seems to take about 100ns.  The total switch from userspace to 
>> userspace is about 4x that.
>
> Probably avoiding these is the only significant large between all
> the versions.  You use predict_false() for executing them.  Are fsbase
> and gsbase really usually constant across processes?

If they are non threaded, yes.

>
> 400nS is about what I get for i386 on 2.2GHz A64 UP too (6.17 S for
> ./yield 1000000 10).  getpid() on this machine takes 180nS so it is
> unreasonable to expect sched_yield() to take much less than a few hundred
> nS.
>
> Some perfmon output for ./yield 100000 10:
>
> % # s/kx-ls-microarchitectural-resync-by-self-mod-code % 0
> % # s/kx-ls-buffer2-full % 909905
> % # s/kx-ls-retired-cflush-instructions % 0
> % # s/kx-ls-retired-cpuid-instructions % 0
> % # s/kx-dc-accesses % 496436422
> % # s/kx-dc-misses % 11102024
>
> 11 cache dmisses per yield.  Probably the main cause of slowness (main
> memory latency on this machine is 42 nsec so 11 cache misses takes
> 462 of the 617 nS per call?).

Yes I reduced that recently by reordering struct tdq and td_sched some. 
It would be even better if we could group the scheduling related fields of 
td_* near the bottom with td_sched.  This would require more tedius 
initialization in fork and would be prone to being disturbed by people 
adding fields to struct thread wherever they please.  Ultimately it 
doesn't matter that much except in this microbenchmarks anyway.

>
> % # s/kx-dc-refills-from-l2 % 0
> % # s/kx-dc-refills-from-system % 0
> % # s/kx-dc-writebacks % 0
> % # s/kx-dc-l1-dtlb-miss-and-l2-dtlb-hits % 3459100
> % # s/kx-dc-l1-and-l2-dtlb-misses % 2138231
> % # s/kx-dc-misaligned-references % 87
> % # s/kx-dc-microarchitectural-late-cancel-of-an-access % 73146415
> % # s/kx-dc-microarchitectural-early-cancel-of-an-access % 236927303
> % # s/kx-bu-cpu-clk-unhalted % 1303921314
> % # s/kx-ic-fetches % 236207869
> % # s/kx-ic-misses % 22988
>
> Insignificant icache misses.
>
> % # s/kx-ic-refill-from-l2 % 18979
> % # s/kx-ic-refill-from-system % 4191
> % # s/kx-ic-l1-itlb-misses % 0
> % # s/kx-ic-l1-l2-itlb-misses % 1619297
> % # s/kx-ic-instruction-fetch-stall % 1034570822
> % # s/kx-ic-return-stack-hit % 20822416
> % # s/kx-ic-return-stack-overflow % 5870
> % # s/kx-fr-retired-instructions % 701240247
> % # s/kx-fr-retired-ops % 1163464391
> % # s/kx-fr-retired-branches % 121636370
> % # s/kx-fr-retired-branches-mispredicted % 2761910
> % # s/kx-fr-retired-taken-branches % 93488548
> % # s/kx-fr-retired-taken-branches-mispredicted % 2848315
>
> 2.8 branches mispredicted per call.
>
> # s/kx-fr-retired-far-control-transfers % 2000934
>
> 1 int0x80 and 1 iret per shched_yield(), and apparentlty not much else.
>
> % # s/kx-fr-retired-resync-branches % 936968
> % # s/kx-fr-retired-near-returns % 19008374
> % # s/kx-fr-retired-near-returns-mispredicted % 784103
>
> 0.8 returns mispredicted per call.
>
> % # s/kx-fr-retired-taken-branches-mispred-by-addr-miscompare % 721241
> % # s/kx-fr-interrupts-masked-cycles % 658462615
>
> Ugh, this is from spinlocks bogusly masking interrupts.  More than half
> the cycles have interrupts masked.  This at least shows that lots of
> time is being spent near cpu_switch() with a spinlock held.
>

I'm not sure why you feel masking interrupts in spinlocks is bogus.  It's 
central to our SMP strategy.  Unless you think we should do it lazily like 
we do with critical_*.  I know jhb had that working at one point but it 
was abandoned.

> % # s/kx-fr-interrupts-masked-while-pending-cycles % 9365
>
> Since the CPU is reasonably fast, interrupts aren't masked for very long
> each time.  This maximum is still 4.5 uS.
>
> % # s/kx-fr-hardware-interrupts % 63
> % # s/kx-fr-decoder-empty % 247898696
> % # s/kx-fr-dispatch-stalls % 589228741
> % # s/kx-fr-dispatch-stall-from-branch-abort-to-retire % 39894120
> % # s/kx-fr-dispatch-stall-for-serialization % 44037193
> % # s/kx-fr-dispatch-stall-for-segment-load % 134520281
>
> 134 cyles per call.  This may be more for ones in syscall() generally.
> I think each segreg load still costs ~20 cycles.  Since this is on
> i386, there are 6 per call (%ds, %es and %fs save and restore), plus
> %ss save and which might not be counted here.  134 is a lot -- about
> 60nS of the 180nS for getpid().
>
> % # s/kx-fr-dispatch-stall-when-reorder-buffer-is-full % 18648001
> % # s/kx-fr-dispatch-stall-when-reservation-stations-are-full % 121485247
> % # s/kx-fr-dispatch-stall-when-fpu-is-full % 19
> % # s/kx-fr-dispatch-stall-when-ls-is-full % 203578275
> % # s/kx-fr-dispatch-stall-when-waiting-for-all-to-be-quiet % 63136307
> % # s/kx-fr-dispatch-stall-when-far-xfer-or-resync-br-pending % 6994131
>
>>> In-kernel switches are not a very typical case since they don't load
>>> %cr3...
>> 
>> We've been working on amd64 so I can't comment specifically about i386 
>> costs. However, I definitely agree that cpu_switch() is not the greatest 
>> overhead in the path.  Also, you have to load cr3 even for kernel threads 
>> because the page directory page or page directory pointer table at %cr3 can 
>> go away once you've switched out the old thread.
>
> I don't see this.  The switch is avoided if %cr3 wouldn't change, which
> I think usually or always happens for switches between kernel threads.

I see, you're saying 'between kernel threads'.  There was some discussion 
of allowing kernel threads to use the page tables of whichever thread was 
last switched in to avoid cr3 in all cases for them.  This requires other 
changes to be safe however.

>
>>> The asm code already saves only call-saved registers for both i386 and
>>> amd64.  It saves call-saved registers even when it apparently doesn't
>>> use them (lots more of these on amd64, while on i386 it uses more
>>> call-saved registers than it needs to, apparently since this is free
>>> after saving all call-saved registers).  I think saving more than is
>>> needed is the result of confusion about what needs to be saved and/or
>>> what is needed for debugging.
>> 
>> It has to save all of the callee saved registers in the PCB because they 
>> will likely differ from thread to thread.  Failing to save and restore them 
>> could leave you returning with the registers having different values and 
>> corrupt the calling function.
>
> Yes, I had forgotten the detail of how the non-local flow of control can
> change the registers (the next call to the function in the context of
> the switched-to-process may have different values in the registers due
> to changes to the registers in callers).
>
> All that can be done differently here is saving all the registers on the
> stack (except %esp) in the usual way.  This would probably be faster on
> old i386's using pushal or pushl, but on amd64 pushal is not available,
> and on Athlons generally (before Barcelona?) it is faster not to use pushl,
> so on amd64 the registers should be saved using movl and then it is just
> as easy to put them in the pcb as on the stack.
>
>>>> The good news is that this tuning is finally being done.  It should
>>>> have been done in 2003 though...
>>> 
>>> How is this possible with (according to my theory) most of the context
>>> switch cost being for %cr3 and upper layers?  Unchanged amd64 has only
>>> a few more costs than i386.  Mainly 3 unconditional wrmsr's and 2
>>> unconditional rdmsr's for managing gsbase and fsbase.  I thought that
>>> these were hard to avoid and anyway not nearly as expensive as %cr3 loads.
>> 
>> %cr3 is actually a lot less expensive these days with page table flush 
>> filters and the PG_G bit.  We were able to optimize away setting the msrs 
>> in the case that the previous values match the new values.  Apparently the 
>> hardware doesn't optimize this case so we have to do comparisons ourselves.
>> 
>> That was a big chunk of the optimization.  Static branch hints, reordering 
>> code, possibly reordering for better pipeline scheduling in peter's asm, 
>> etc. provide the rest.
>
> All the old i386 asm and probably clones of it on amd64 is certainly not
> optimized globally for anything newer than an i386 (barely even an i486).
> This rarely matters however.  It lost more on Pentium-1's, but now out of
> order execution and better branch prediction hides most inefficiencies.
>
> Bruce
>

Jeff



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080313132152.Y1091>