From owner-freebsd-hackers@FreeBSD.ORG Mon Feb 16 15:58:49 2004 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3145916A4CF for ; Mon, 16 Feb 2004 15:58:49 -0800 (PST) Received: from apollo.backplane.com (apollo.backplane.com [216.240.41.2]) by mx1.FreeBSD.org (Postfix) with ESMTP id 1806C43D2F for ; Mon, 16 Feb 2004 15:58:49 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: from apollo.backplane.com (localhost [127.0.0.1]) i1GNwk82088107; Mon, 16 Feb 2004 15:58:46 -0800 (PST) (envelope-from dillon@apollo.backplane.com) Received: (from dillon@localhost) by apollo.backplane.com (8.12.9p2/8.12.9/Submit) id i1GNwkkm088106; Mon, 16 Feb 2004 15:58:46 -0800 (PST) (envelope-from dillon) Date: Mon, 16 Feb 2004 15:58:46 -0800 (PST) From: Matthew Dillon Message-Id: <200402162358.i1GNwkkm088106@apollo.backplane.com> To: "Juan Tumani" References: cc: des@des.no cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD 5.2 v/s FreeBSD 4.9 MFLOPS performance (gcc3.3.3v/sgcc2.9.5) X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Feb 2004 23:58:49 -0000 : :Thanks Matt for picking up on the linker problem. Patching the kernel :would, to me, be masking the real problem. : :What other "improvements" does gcc333 have over gcc295 that might :explain why it's linked products run in a half-fast mode (take twice+ :as long)? : :JT I do not see a 50% loss in performance in my tests, but the GCC3 on DragonFly is a later snapshot (gcc-3.3-20040126). Generally speaking GCC3 does a better job -O2 then GCC2 when I optimize for my Athlon64. (-O2 and -O3 have the same results on GCC3 in my tests). These tests were run on an Athlon 64 3200+, on a DragonFly system of course, (which has both gcc2 and gcc3 in the base system): GCC2 GCC2 GCC2 GCC3 GCC3 GCC3 GCC3 -O -O2 -O2/k6 -O -O2 -O2 -O2 athlon athlon stackbndry=5 MFLOPS(1) 1111 1071 1068 794 926 862 861 MFLOPS(2) 832 818 810 789 825 855 857 MFLOPS(3) 1131 1121 1105 1021 1134 1208 1208 MFLOPS(4) 1306 1356 1350 1156 1346 1460 1456 GCC3 only loses in MFLOPS(1). When I looked at the assembly generated for MFLOPS(1) between GCC2 and GCC3 two things stand out: * GCC2 does a few extra stack-relative memory ops and they are spread out more. GCC3 does fewer memory ops and they are concentrated at the beginning and the end of the loop code. * GCC2 uses fld %st(x) to shift the FP stack around, while GCC3 uses fxch %st(x) to shift the FP stack around. Since we know FP operations are stack-alignment-sensitive I can see how a stack misalignment can result in terrible performance. What is less certain is whether (FP aligned) accesses to *different* data-cache lines effects performance, and that is something that GCC does not seem to optimize. My guess at least in regards to MFLOPS(1), for which GCC3 generates consistently worse results on my machine, is that FXCH (exchange fp reg with top of fp stack) performance is not hardware optimized as well as FLD (load to top of FP stack) performance, at least on my Athlon 64. This also points to the fact that both Intel and AMD have done major reoptimizations of their floating point instruction set in nearly every processor release they've ever done. The performance loss you are seeing on your machine could very well turn into a performance gain on different cpu. On a DELL-2550 I get this: DELL2550 2xPentiumIII @ 1.1GHz GCC2 GCC3 GCC3 GCC3 -O3 -O3 -O3 -O3 -march= (nil) (nil) p3 ppro MFLOPS(1) 380 290 283 283 MFLOPS(2) 302 293 291 291 MFLOPS(3) 454 459 462 463 MFLOPS(4) 563 581 593 593 My guess is that GCC3 introduced a bit of pessimization when they started over-using FXCH and that the MFLOPS(1) code just happens to hit the case due to the huge number of FXCH's it uses. It's probably stalling the instruction pipline in a few more places. -Matt