From owner-freebsd-questions@FreeBSD.ORG Fri Oct 9 07:20:30 2009 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9AC71106566B for ; Fri, 9 Oct 2009 07:20:30 +0000 (UTC) (envelope-from bennett@cs.niu.edu) Received: from mp.cs.niu.edu (mp.cs.niu.edu [131.156.145.41]) by mx1.freebsd.org (Postfix) with ESMTP id 276818FC14 for ; Fri, 9 Oct 2009 07:20:30 +0000 (UTC) Received: from mp.cs.niu.edu (bennett@localhost [127.0.0.1]) by mp.cs.niu.edu (8.14.3/8.14.3) with ESMTP id n997KQGK016126; Fri, 9 Oct 2009 02:20:26 -0500 (CDT) Date: Fri, 9 Oct 2009 02:20:26 -0500 (CDT) From: Scott Bennett Message-Id: <200910090720.n997KQ0D016125@mp.cs.niu.edu> To: freebsd-questions@freebsd.org, Pierre-Luc Drouin Cc: Subject: When is it worth enabling hyperthreading? X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 09 Oct 2009 07:20:30 -0000 On Wed, 07 Oct 2009 23:24:48 -0400 Pierre-Luc Drouin wrote: >Could someone explain me in which cases it is useful to enable >hyperthreading on a machine running FreeBSD 8.0 and in which other cases >it is not a good idea? Is that possible that hyperthreading is >disadvantageous unless the number of active (non-sleeping) threads is >really high? > >For example, if I have an i7 CPU with 4 physical cores and that I run >some multi-threaded code that has only 4 threads, it will run almost >always (twice) slower with hyperthreading enabled than when I disable it >in the BIOS. If I understand correctly, hyperthreading has the advantage >of being able to do CPU context switching faster than the OS, but it No. Both context execute simultaneously. Each logical CPU of the two logical CPUs in a core has its own set of registers, LDT and GDT pointer registers, and instruction counter. Both compete for the same remaining set of resources: DAT, TLB, FPU, cache (all levels for a given core), busses to off-chip resources, and--most critically--pipeline slots per clock cycle. Any time a resource shared by the two logical CPUs (what the logical CPUs execute are called "CPU threads" or "hyperthreads") is in use by one logical CPU, it is unavailable for use by the other logical CPU. If a logical CPU needs a resource unavailable due to its being in use by the other logical CPU, the late-comer's processing is suspended until the resource is released by the other logical CPU. Such a lockout situation is not directly detectable in software because the locked-out instruction is still in execution; it's just taking more than the usual number of cycles to complete. On a P4 Prescott chip or the late models of single-cored Xeons, the pipeline structure is apparently less than ideal for sustained simultaneous execution; i.e., there are frequent pairings of instructions that require more than the available pipeline slots of the types required by the two parallel instructions, which causes one of them to spin until the other moves on, opening the next cycle's set of pipeline slots. A simple case can demonstrate the problem, although on most systems this example would likely be infrequent. There is only one FPU pipeline on these chips, so two floating-point instructions executing simultaneously will result in one getting the FPU pipeline slot for the current cycle, while the other one spins until the next cycle, whereupon the other side will spin, etc. What is actually the more common occurrence is that other types of instruction pairs will require, for example, four slots of a type that only has three pipelines. The Core i7 chips (don't know about the other Core iN series) are alleged to have an improved assortment of pipelines w.r.t. typical instruction mixes, although I think there is still only one FPU per core, so the parallelism is supposed to be rather more effective on these chips than on their forerunners in the Pentium/Xeon series. It has been quite a while since I last tried measuring it, but IIRC, a "make buildworld" on my 3.4 GHz P4 Prescott takes about one to two minutes longer elapsed time in non-hyperthreading mode with MAKEFLAGS set to "-j3" than it does with hyperthreading enabled and MAKEFLAGS set to "-j5" (i.e., something like 52 - 53 minutes instead of 51 minutes and a few seconds). Your quad-core Core i7 chips ought to provide a much greater benefit with hyperthreading enabled, relatively speaking. The traditional recommendation for the -j flag for make(1) is 3*nCPUs, but hyperthreading doesn't give you a full CPU's worth of extra processing, so your quad-core chips won't give you a full 8 CPUs' worth. In other words, a single, large, parallel make job probably should have -j set to something under 24 yet still greater than 12, as a guess perhaps 20ish. :-) But do try it yourself at different -j values, and let us know how your timings turn out on that chip, along with the model number of the chip. >does this context switching systematically instead of only when >requested, so it slows things down unless the number of running >(non-sleeping) threads is greater or equal to let say the number of >physical threads x 1.5-1.75. > In general, there is a slight gain, although running parallel floating-point activities is a break-even situation and not worth the bother unless you're just trying to learn OpenMP or some such. When I've disabled hyperthreading, interactive response has often seemed a tad less snappy when running some CPU-bound process at the same time. OTOH, with hyperthreading enabled, I sometimes notice a bit more jerkiness in things like scrolling in firefox, but it's not easy to tell what's really happening there because firefox typically has at least 7 threads itself. :-) Like Bill Moran said, user interfaces do seem a bit more responsive, and I haven't seen any noticeable *loss* in overall performance. The "make buildworld" example runs long enough to give some idea, and it always runs a little bit faster under hyperthreading than in uniprocessor mode. A "make buildkernel" also shows a bit of improvement. I've never seen any dramatic improvement, but the slight improvement is sometimes apparent. Also, when running Windows XP, having hyperthreading enabled has allowed me to get out from under some runaway, single-threaded process, even though doing so can take a while because the runaway process does compete vigorously for the shared resources discussed above. :-) Nevertheless, without the extra logical CPU, a manual reboot would have been necessary to regain control of the machine. Scott Bennett, Comm. ASMELG, CFIAG ********************************************************************** * Internet: bennett at cs.niu.edu * *--------------------------------------------------------------------* * "A well regulated and disciplined militia, is at all times a good * * objection to the introduction of that bane of all free governments * * -- a standing army." * * -- Gov. John Hancock, New York Journal, 28 January 1790 * **********************************************************************