Date: Mon, 25 Aug 2003 13:35:09 -0400 (EDT) From: Garrett Wollman <wollman@khavrinen.lcs.mit.edu> To: "Daniel C. Sobral" <dcs@tcoip.com.br> Cc: current@freebsd.org Subject: Re: HTT on current Message-ID: <200308251735.h7PHZ9bd094222@khavrinen.lcs.mit.edu> In-Reply-To: <3F4A43EA.9090500@tcoip.com.br> References: <JCEIKJMCANNPGKFKGLKLOENEDJAA.mikej@trigger.net> <3F4A1CE2.6080806@freebsd.org> <20030825164907.GA17503@dragon.nuxi.com> <3F4A43EA.9090500@tcoip.com.br>
next in thread | previous in thread | raw e-mail | index | archive | help
<<On Mon, 25 Aug 2003 14:14:18 -0300, "Daniel C. Sobral" <dcs@tcoip.com.br> said: > There are two problems with HTT. First, L1/L2 cache issues. Second, the > virtual CPUs are not independent, and there are many cases where > instructions in one virtual CPU stall the other. So take, for example, > the case of a userland application on CPU0 stalling the kernel on CPU1. I don't think that this is quite stated right. The problem is that the P4 is not very wide to begin with, and it's very hard to optimize well for that 23-stage pipeline.[1] So if you have a thread with lots of latent ILP (either because you did a good job optimizing it for a four-way superscalar, or because you did a bad job scheduling it and are depending on the processor to make up for the naive optimization), it is bound to run more slowly when some of the functional units it could have used are taken by another thread of execution. But some sorts of applications can benefit, if the application can be decomposed into threads that exercise different FUs (for example, one thread that is memory intensive and one thread that is compute intensive). The challenge then is to make sure that they always get scheduled on the same processor at the same time. The key to getting good performace on an SMT architecture with an arbitrary instruction mix is more functional units. The never-built Alpha EV8, which was to be an eight-way superscalar with four-way SMT and a wide memory bus, would be much easier with which to achieve optimum performance. -GAWollman [1] That's why the Athlon gets more instructions per cycle: it has a much shallower pipeline and more functional units, so it can execute naively-optimized, ILP-heavy code much faster without stalling.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200308251735.h7PHZ9bd094222>