From owner-svn-src-all@FreeBSD.ORG Thu Sep 30 17:33:29 2010 Return-Path: Delivered-To: svn-src-all@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9FC6C106564A; Thu, 30 Sep 2010 17:33:29 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au [211.29.132.190]) by mx1.freebsd.org (Postfix) with ESMTP id 34A458FC1B; Thu, 30 Sep 2010 17:33:28 +0000 (UTC) Received: from c122-107-116-249.carlnfd1.nsw.optusnet.com.au (c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249]) by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id o8UHXQN5004113 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Fri, 1 Oct 2010 03:33:27 +1000 Date: Fri, 1 Oct 2010 03:33:25 +1000 (EST) From: Bruce Evans X-X-Sender: bde@delplex.bde.org To: Dimitry Andric In-Reply-To: <4CA49BE9.8040602@FreeBSD.org> Message-ID: <20101001025557.W700@delplex.bde.org> References: <201009292120.o8TLKTSf022159@svn.freebsd.org> <201009291812.26796.jkim@FreeBSD.org> <20100930125731.B2324@delplex.bde.org> <4CA49BE9.8040602@FreeBSD.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Jung-uk Kim , Bruce Evans Subject: Re: svn commit: r213281 - head/lib/libc/amd64/gen X-BeenThere: svn-src-all@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: "SVN commit messages for the entire src tree \(except for " user" and " projects" \)" List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 30 Sep 2010 17:33:29 -0000 On Thu, 30 Sep 2010, Dimitry Andric wrote: > On 2010-09-30 05:46, Bruce Evans wrote: > ... >> This file probably shouldn't exist, especially on amd64. There are 4 or 5 >> versions of ldexp(), and this file implements what seems to be the worst >> one, even without the bug. >> ... > > The version in libc/gen/ldexp.c is just a copy of msun/src/s_scalbn.c, > with some things like copysign() directly pasted in. It even has: > > /* @(#)fdlibm.h 5.1 93/09/24 */ > > at the top. Bah, I missed this sixth version :-). >> Testing indicates that the fdlibm C version is 2.5 times faster than the >> asm versions on amd64 on a core2 (ref9), while on i386 the C version is >> only 1.5 times faster. The C code is a bit larger so benefits more from >> being called from a loop. The asm code uses a slow i387 instruction, and >> on i387 it hhs to do expensive moves from xmm registers to i387 ones and >> back. >> >> Times for 100 million calls: >> >> amd64 libc ldexp: 3.18 seconds >> amd64 libm asm scalbn: 2.96 >> amd64 libm C scalbn: 1.30 >> i386 libc ldexp: 3.13 >> i386 libm asm scalbn: 2.86 >> i386 libm C scalbn: 2.11 > > Seeing these results, I propose to just delete > lib/libc/amd64/gen/ldexp.c and lib/libc/i386/gen/ldexp.c, which will > cause the amd64 and i386 builds to automatically pick up > lib/libc/gen/ldexp.c instead, which effectively is the fdlibm > implementation. (And no more clang workarounds needed. :) I like this idea. Does anyone have ideas for better testing? The loop also benefits machines with multiple pipelines and/or out/of order execution. Especially with the latter I think it is possible for several iterations to be in progress at once (looks like an average of about 1.5 for AthlonXP and later in other similar loop benchmarks). In other benchmarks I use a volatile variable to be more sure of defeating unwanted compiler optimizations, but I don't want to enforce serialization since non-benchmarks don't do that. In libm functions, the largest optimizations are from avoiding as internal serialization as much as possible. Using the i387 functions tends to defeat this since there is only 1 ALU for them (unlike for i387 addition, etc.; there are 2 ALUs for that on AthlonXP and later). Perhaps the i387 functions will be relatively faster again someday when there are more ALUs for them and better microcode in them, but x86 architects apparently consider this a low priority and/or the microcode is too hard make better than ordinary instructions. I think big functions using ordinary instructions are OK if they are slightly faster than i387 functions, since if they aren't called much then it doesn't matter and if they are called much then they will stay cached. But in they latter case, they will push other code out of caches; I don't know how to quantify this. Bruce