Date: Thu, 19 Jun 2008 17:42:02 +0200 From: "Attilio Rao" <attilio@freebsd.org> To: "Peter Wemm" <peter@freebsd.org> Cc: cvs-src@freebsd.org, src-committers@freebsd.org, cvs-all@freebsd.org Subject: Re: cvs commit: src/sys/amd64/amd64 cpu_switch.S Message-ID: <3bbf2fe10806190842s381611del5c5dc27d2dd22a7e@mail.gmail.com> In-Reply-To: <200803232309.m2NN96Qa080896@repoman.freebsd.org> References: <200803232309.m2NN96Qa080896@repoman.freebsd.org>
next in thread | previous in thread | raw e-mail | index | archive | help
2008/3/24, Peter Wemm <peter@freebsd.org>: > peter 2008-03-23 23:09:06 UTC > > FreeBSD src repository > > Modified files: > sys/amd64/amd64 cpu_switch.S > Log: > First pass at (possibly futile) microoptimizing of cpu_switch. Results > are mixed. Some pure context switch microbenchmarks show up to 29% > improvement. Pipe based context switch microbenchmarks show up to 7% > improvement. Real world tests are far less impressive as they are > dominated more by actual work than switch overheads, but depending on > the machine in question, workload, kernel options, phase of moon, etc, a > few percent gain might be seen. > > Summary of changes: > - don't reload MSR_[FG]SBASE registers when context switching between > non-threaded userland apps. These typically cost 120 clock cycles each > on an AMD cpu (less on Barcelona/Phenom). Intel cores are probably no > faster on this. > - The above change only helps unthreaded userland apps that tend to use > the same value for gsbase. Threaded apps will get no benefit from this. > - reorder things like accessing the pcb to be in memory order, to give > prefetching a better chance of working. Operations are now in increasing > memory address order, rather than reverse or random. > - Push some lesser used code out of the main code paths. Hopefully > allowing better code density in cache lines. This is probably futile. > - (part 2 of previous item) Reorder code so that branches have a more > realistic static branch prediction hint. Both Intel and AMD cpus > default to predicting branches to lower memory addresses as being > taken, and to higher memory addresses as not being taken. This is > overridden by the limited dynamic branch prediction subsystem. A trip > through userland might overflow this. > - Futule attempt at spreading the use of the results of previous operations > in new operations. Hopefully this will allow the cpus to execute in > parallel better. > - stop wasting 16 bytes at the top of kernel stack, below the PCB. > - Never load the userland fs/gsbase registers for kthreads, but preserve > curpcb->pcb_[fg]sbase as caches for the cpu. (Thanks Jeff!) > > Microbenchmarking this code seems to be really sensitive to things like > scheduling luck, timing, cache behavior, tlb behavior, kernel options, > other random code changes, etc. > > While it doesn't help heavy userland workloads much, it does help high > context switch loads a little, and should help those that involve > switching via kthreads a bit more. > > A special thanks to Kris for the testing and reality checks, and Jeff for > tormenting me into doing this. :) > > This is still work-in-progress. It looks like this patch introduces a regression. In particular, this chunk: @@ -181,82 +166,138 @@ sw1: cmpq %rcx, %rdx pause je 1b - lfence #endif is not totally right as we want to enforce an acq -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3bbf2fe10806190842s381611del5c5dc27d2dd22a7e>