Date: Mon, 10 Mar 2008 16:26:17 -1000 (HST) From: Jeff Roberson <jroberson@chesapeake.net> To: arch@freebsd.org Subject: amd64 cpu_switch in C. Message-ID: <20080310161115.X1091@desktop>
next in thread | raw e-mail | index | archive | help
http://people.freebsd.org/~jeff/amd64.diff At the above address there is an implementation of cpu_switch() and cpu_throw() for amd64 almost entirely in C. I'm posting this for discussion and eventual commit. There are numerous reasons to do this, I will outline some of them. Implementing the bulk of the code in C allows us to add/modify higher level features more easily. For example, we can change the pmap active bits to use a cpuset_t so we can support more than 64 cpus. It makes the code faster because we can do more complicated checks to save time, such as avoiding writing the fs/gsbase MSRs if they have not changed. It makes the code faster because infrequently used options can be moved out of the normal code paths. In fact, the c version is ~10% faster than the assembly version at a two thread sched_yield() test on a single cpu opteron: x asm.yield + csw.yield +------------------------------------------------------------------------------+ | ++ x x | |+ ++ ++ + + + + + ++ +x x x x xxx x| | |______M_____A___________| |__________AM__________| | +------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 10 5.17 5.88 5.5 5.479 0.19272606 + 15 4.58 5.16 4.71 4.8126667 0.20738049 Difference at 95.0% confidence -0.666333 +/- 0.170431 -12.1616% +/- 3.11062% (Student's t, pooled s = 0.201773) This test measures the total time to call sched_yield() 10,000,000 times between two threads. Two threads are needed to be sure that the scheduler doesn't pick the same thread twice and skip cpu_switch(). The 10% speedup is notable because the cpu_switch() routine was consuming less than 40% of the cpu prior to the speedup. So it's almost 1/3rd faster. Peter also suggested that we can delay portions of the switch until the user boundary. For workloads that involve heavy kernel activity on the users part with multiple switches per-syscall this would be a big savings. We could also use this as a framework to implement custom switch routines if we want to switch directly to ithreads or taskqueue threads in the future. The C routine is supplemented by two assembly routines which are responsible for saving the core architecture state and manipulating the stack. These total approximately 50 assembly instructions and are similar to savecontext/swapcontext. The c code saves the old threads context but still runs on its stack as it continues the switch. This is safe because the old thread is locked until we call "cpu_switchin()" which is similar to swapcontext. The only appreciable downside is that it lowers the barrier of entry for modifying a very sensitive piece of code. Still, I think the flexibility it gives us outweighs those concerns. Comments? Thanks, Jeff
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20080310161115.X1091>