Date: Sat, 17 Jun 2006 07:10:19 GMT From: Bruce Evans <bde@zeta.org.au> To: freebsd-bugs@FreeBSD.org Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu Message-ID: <200606170710.k5H7AJLf063598@freefall.freebsd.org>
next in thread | raw e-mail | index | archive | help
The following reply was made to PR kern/98460; it has been noted by GNATS. From: Bruce Evans <bde@zeta.org.au> To: Rostislav Krasny <rosti.bsd@gmail.com> Cc: freebsd-gnats-submit@FreeBSD.org Subject: Re: kern/98460 : [kernel] [patch] fpu_clean_state() cannot be disabled for not AMD processors, those are not vulnerable to FreeBSD-SA-06:14.fpu Date: Sat, 17 Jun 2006 17:01:27 +1000 (EST) On Fri, 16 Jun 2006, Rostislav Krasny wrote: > On Fri, 16 Jun 2006 22:50:01 +1000 (EST) > Bruce Evans <bde@zeta.org.au> wrote: > >> Why are we worrying about just this and not all the other branches on >> cpu_fxsr, not to mention all other branches in the kernel :-)? > > I think it is a matter of principle. AMD saved few microcomands in > their incorrect implementation of two Pentium III instructions. And now > buyers if their processors are paying much more than those few > microcomands. No, the non-AMD users pay much less (unless the cost of branch prediction is very large). When I tried to measure the overhead for the fix, I found that fxsave+fxrstor takes almost twice as long on a P4(Xeon) as on an Athlon(XP,64). That's about 150 cycles longer IIRC. The fix costs only 14 cycles. These measurements were in microbenchmarks that loop (and in manuals that assume similar best-case setups). The extra 150 cycles is free if it is done in parallel with integer operations. npxdna() only does the fxrstor half and has limited parallelism, and I haven't measured how many of the extra 150/2 cycles are free (probably none). 14 cycles for the fix assumes no branch misprediction. 14 cycles is a lot from one point of view, but from a practical point of view it is the same as 0. Suppose that the kernel does 1000 context switches per second per CPU (too many for efficiency since it thrashes caches), and that an FPU switch occurs on all of these (it would normally be much less than that since half of all context switches are often to kernel threds (and half back), and many threads don't use the FPU. We then waste 14000 cycles per second + more for branch misprediction and other cache effects. At 2GHz 14000 cycles is a whole 7uS. > Why should buyers of processors from other manufacturers, > which implemented FXSAVE and FXRSTOR correctly, pay even a tiny bit of > their performance for nothing? Because they can't measure the difference? I think that unless you modify millions of branches, there is more to be gained from things like scheduling instructions so that high-latency instructions like fxrstor are started early, but the gains here are still relatively small and are better done by compliers and CPUs because the best scheduling is machine-dependent. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200606170710.k5H7AJLf063598>
