From owner-svn-src-all@FreeBSD.ORG  Thu Sep 30 17:33:29 2010
Return-Path: <owner-svn-src-all@FreeBSD.ORG>
Delivered-To: svn-src-all@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9FC6C106564A;
	Thu, 30 Sep 2010 17:33:29 +0000 (UTC)
	(envelope-from brde@optusnet.com.au)
Received: from mail09.syd.optusnet.com.au (mail09.syd.optusnet.com.au
	[211.29.132.190])
	by mx1.freebsd.org (Postfix) with ESMTP id 34A458FC1B;
	Thu, 30 Sep 2010 17:33:28 +0000 (UTC)
Received: from c122-107-116-249.carlnfd1.nsw.optusnet.com.au
	(c122-107-116-249.carlnfd1.nsw.optusnet.com.au [122.107.116.249])
	by mail09.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id
	o8UHXQN5004113
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO);
	Fri, 1 Oct 2010 03:33:27 +1000
Date: Fri, 1 Oct 2010 03:33:25 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@delplex.bde.org
To: Dimitry Andric <dim@FreeBSD.org>
In-Reply-To: <4CA49BE9.8040602@FreeBSD.org>
Message-ID: <20101001025557.W700@delplex.bde.org>
References: <201009292120.o8TLKTSf022159@svn.freebsd.org>
	<201009291812.26796.jkim@FreeBSD.org>
	<20100930125731.B2324@delplex.bde.org>
	<4CA49BE9.8040602@FreeBSD.org>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org,
	src-committers@FreeBSD.org, Jung-uk Kim <jkim@FreeBSD.org>,
	Bruce Evans <brde@optusnet.com.au>
Subject: Re: svn commit: r213281 - head/lib/libc/amd64/gen
X-BeenThere: svn-src-all@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: "SVN commit messages for the entire src tree \(except for &quot;
	user&quot; and &quot; projects&quot; \)" <svn-src-all.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/svn-src-all>
List-Post: <mailto:svn-src-all@freebsd.org>
List-Help: <mailto:svn-src-all-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/svn-src-all>,
	<mailto:svn-src-all-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 30 Sep 2010 17:33:29 -0000

On Thu, 30 Sep 2010, Dimitry Andric wrote:

> On 2010-09-30 05:46, Bruce Evans wrote:
> ...
>> This file probably shouldn't exist, especially on amd64.  There are 4 or 5
>> versions of ldexp(), and this file implements what seems to be the worst
>> one, even without the bug.
>> ...
>
> The version in libc/gen/ldexp.c is just a copy of msun/src/s_scalbn.c,
> with some things like copysign() directly pasted in.  It even has:
>
> /* @(#)fdlibm.h 5.1 93/09/24 */
>
> at the top.

Bah, I missed this sixth version :-).

>> Testing indicates that the fdlibm C version is 2.5 times faster than the
>> asm versions on amd64 on a core2 (ref9), while on i386 the C version is
>> only 1.5 times faster.  The C code is a bit larger so benefits more from
>> being called from a loop.  The asm code uses a slow i387 instruction, and
>> on i387 it hhs to do expensive moves from xmm registers to i387 ones and
>> back.
>> 
>> Times for 100 million calls:
>> 
>>       amd64 libc ldexp:      3.18 seconds
>>       amd64 libm asm scalbn: 2.96
>>       amd64 libm C scalbn:   1.30
>>       i386  libc ldexp:      3.13
>>       i386  libm asm scalbn: 2.86
>>       i386  libm C scalbn:   2.11
>
> Seeing these results, I propose to just delete
> lib/libc/amd64/gen/ldexp.c and lib/libc/i386/gen/ldexp.c, which will
> cause the amd64 and i386 builds to automatically pick up
> lib/libc/gen/ldexp.c instead, which effectively is the fdlibm
> implementation.  (And no more clang workarounds needed. :)

I like this idea.

Does anyone have ideas for better testing?  The loop also benefits
machines with multiple pipelines and/or out/of order execution.
Especially with the latter I think it is possible for several iterations
to be in progress at once (looks like an average of about 1.5 for
AthlonXP and later in other similar loop benchmarks).  In other
benchmarks I use a volatile variable to be more sure of defeating
unwanted compiler optimizations, but I don't want to enforce serialization
since non-benchmarks don't do that.  In libm functions, the largest
optimizations are from avoiding as internal serialization as much as
possible.  Using the i387 functions tends to defeat this since there is
only 1 ALU for them (unlike for i387 addition, etc.; there are 2 ALUs
for that on AthlonXP and later).  Perhaps the i387 functions will be
relatively faster again someday when there are more ALUs for them and
better microcode in them, but x86 architects apparently consider this
a low priority and/or the microcode is too hard make better than ordinary
instructions.

I think big functions using ordinary instructions are OK if they are
slightly faster than i387 functions, since if they aren't called much
then it doesn't matter and if they are called much then they will stay
cached.  But in they latter case, they will push other code out of caches;
I don't know how to quantify this.

Bruce