From owner-freebsd-numerics@freebsd.org  Sun Sep  8 05:55:51 2019
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id 04E15E9287;
 Sun,  8 Sep 2019 05:55:51 +0000 (UTC)
 (envelope-from stefan.kanthak@nexgo.de)
Received: from vsmx009.vodafonemail.xion.oxcs.net
 (vsmx009.vodafonemail.xion.oxcs.net [153.92.174.87])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46R0s10sg7z4Wrk;
 Sun,  8 Sep 2019 05:55:48 +0000 (UTC)
 (envelope-from stefan.kanthak@nexgo.de)
Received: from vsmx001.vodafonemail.xion.oxcs.net (unknown [192.168.75.191])
 by mta-5-out.mta.xion.oxcs.net (Postfix) with ESMTP id 2967015A7FF7;
 Sun,  8 Sep 2019 05:55:46 +0000 (UTC)
Received: from H270 (unknown [93.230.223.140])
 by mta-5-out.mta.xion.oxcs.net (Postfix) with ESMTPA id 8F3C515A8141;
 Sun,  8 Sep 2019 05:55:39 +0000 (UTC)
Message-ID: <174BDDD122964DA9AD32D77663AB863D@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: <freebsd-numerics@freebsd.org>,
	<freebsd-hackers@freebsd.org>
Cc: <das@FreeBSD.ORG>
Subject: Shorterr releng/12.0/lib/msun/i387/s_remquo.S,
 releng/12.0/lib/msun/amd64/s_remquo.S, ...
Date: Sun, 8 Sep 2019 07:52:46 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-VADE-STATUS: LEGIT
X-Rspamd-Queue-Id: 46R0s10sg7z4Wrk
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of stefan.kanthak@nexgo.de designates
 153.92.174.87 as permitted sender) smtp.mailfrom=stefan.kanthak@nexgo.de
X-Spamd-Result: default: False [-2.06 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.93)[-0.932,0];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(-0.20)[+ip4:153.92.174.0/24];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain];
 MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[];
 DMARC_NA(0.00)[nexgo.de]; HAS_ORG_HEADER(0.00)[];
 RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[];
 NEURAL_HAM_LONG(-1.00)[-0.997,0];
 NEURAL_HAM_SHORT(-0.23)[-0.229,0]; HAS_X_PRIO_THREE(0.00)[3];
 IP_SCORE(-0.00)[country: DE(-0.01)]; FROM_EQ_ENVFROM(0.00)[];
 RCVD_IN_DNSWL_LOW(-0.10)[87.174.92.153.list.dnswl.org : 127.0.5.1];
 R_DKIM_NA(0.00)[];
 ASN(0.00)[asn:60664, ipnet:153.92.174.0/24, country:DE];
 MID_RHS_NOT_FQDN(0.50)[];
 RECEIVED_SPAMHAUS_PBL(0.00)[140.223.230.93.khpj7ygk5idzvmvt5x4ziurxhy.zen.dq.spamhaus.net
 : 127.0.0.10]
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Sep 2019 05:55:51 -0000

Hi,

here's a patch to shave 4 instructions (and about 25% code size)
from
http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquo.S
http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquof.S
http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquol.S
http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquo.S
http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquof.S
http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquol.S

Especially the negation is rather clumsy:
1. the 2 shifts by 16 to propagate the sign to all bits can be
   replaced with a single shift by 31, or with a CLTD alias CDQ
   (which is 2 bytes shorter);
2. the conversion of -1 to +1 via AND and its addition can be
   replaced by subtraction of -1.

The minor differences between the code for the float, double and
long double as well as the i387 and amd64 implementations are
intended; pick the variant you like best.
I prefer and recommend the variant with 3 ADC and 2 SHL instructions
used for the i387 double-precision function
http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquo.S,
which comes first.

stay tuned
Stefan Kanthak

PS: if you ever need to run these functions on a CPU without barrel
    shifter, replace the first SHL or ROR with BT $14,%eax and the
    second SHL or ROL with BT $9,%eax ... and hope that BT doesn't
    use a slow shift under the hood.

--- -/releng/12.0/lib/msun/i387/s_remquo.S
+++ +/releng/12.0/lib/msun/i387/s_remquo.S
@@ -34,1 +34,2 @@
 ENTRY(remquo)
+        xorl    %ecx,%ecx
@@ -42,22 +43,17 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        adcl    %ecx,%ecx
+        shll    $18,%eax
+        adcl    %ecx,%ecx
+        shll    $5,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    16(%esp),%ecx
-        xorl    8(%esp),%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    16(%esp),%eax
+        xorl    8(%esp),%eax
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    20(%esp),%ecx
-        movl    %eax,(%ecx)
+        movl    20(%esp),%eax
+        movl    %ecx,(%eax)
         ret
 END(remquo)

--- -/releng/12.0/lib/msun/i387/s_remquof.S
+++ +/releng/12.0/lib/msun/i387/s_remquof.S
@@ -42,22 +42,18 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        sbbl    %ecx,%ecx
+        negl    %ecx
+        shll    $18,%eax
+        adcl    %ecx,%ecx
+        shll    $5,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    8(%esp),%ecx
-        xorl    4(%esp),%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    8(%esp),%eax
+        xorl    4(%esp),%eax
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    12(%esp),%ecx
-        movl    %eax,(%ecx)
+        movl    12(%esp),%eax
+        movl    %ecx,(%eax)
         ret
 END(remquof)

--- -/releng/12.0/lib/msun/i387/s_remquol.S
+++ +/releng/12.0/lib/msun/i387/s_remquol.S
@@ -42,22 +42,19 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        setc    %cl
+        movzbl  %cl,%ecx
+        shll    $18,%eax
+        adcl    %ecx,%ecx
+        shll    $5,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    24(%esp),%ecx
-        xorl    12(%esp),%ecx
-        movsx   %cx,%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    24(%esp),%eax
+        xorl    12(%esp),%eax
+        cwtl
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    28(%esp),%ecx
-        movl    %eax,(%ecx)
+        movl    28(%esp),%eax
+        movl    %ecx,(%eax)
         ret
+END(remquol)

--- -/releng/12.0/lib/msun/amd64/s_remquo.S
--- +/releng/12.0/lib/msun/amd64/s_remquo.S
@@ -34,1 +35,2 @@
 ENTRY(remquo)
+        xorl    %ecx,%ecx
@@ -44,19 +45,14 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        adcl    %ecx,%ecx
+        rorl    $15,%eax
+        adcl    %ecx,%ecx
+        roll    $6,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    -12(%rsp),%ecx
-        xorl    -4(%rsp),%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    -12(%rsp),%eax
+        xorl    -4(%rsp),%eax
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    %eax,(%rdi)
+        movl    %ecx,(%rdi)

--- -/releng/12.0/lib/msun/amd64/s_remquof.S
--- +/releng/12.0/lib/msun/amd64/s_remquof.S
@@ -44,19 +44,15 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        sbbl    %ecx,%ecx
+        negl    %ecx
+        rorl    $15,%eax
+        adcl    %ecx,%ecx
+        roll    $6,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    -8(%rsp),%ecx
-        xorl    -4(%rsp),%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    -8(%rsp),%eax
+        xorl    -4(%rsp),%eax
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    %eax,(%rdi)
+        movl    %ecx,(%rdi)

--- -/releng/12.0/lib/msun/amd64/s_remquol.S
--- +/releng/12.0/lib/msun/amd64/s_remquol.S
@@ -42,21 +42,18 @@
 /* Extract the three low-order bits of the quotient from C0,C3,C1. */
-        shrl    $6,%eax
-        movl    %eax,%ecx
-        andl    $0x108,%eax
-        rorl    $7,%eax
-        orl     %eax,%ecx
-        roll    $4,%eax
-        orl     %ecx,%eax
-        andl    $7,%eax
+        setc    %cl
+        movzbl  %cl,%ecx
+        rorl    $15,%eax
+        adcl    %ecx,%ecx
+        roll    $6,%eax
+        adcl    %ecx,%ecx
 /* Negate the quotient bits if x*y<0.  Avoid using an unpredictable branch. */
-        movl    32(%rsp),%ecx
-        xorl    16(%rsp),%ecx
-        movsx   %cx,%ecx
-        sarl    $16,%ecx
-        sarl    $16,%ecx
-        xorl    %ecx,%eax
-        andl    $1,%ecx
-        addl    %ecx,%eax
+        movl    32(%rsp),%eax
+        xorl    16(%rsp),%eax
+        cwtl
+        cltd
+        xorl    %edx,%ecx
+        subl    %edx,%ecx
 /* Store the quotient and return. */
-        movl    %eax,(%rdi)
+        movl    %ecx,(%rdi)
         ret
+END(remquol)

From owner-freebsd-numerics@freebsd.org  Sun Sep  8 14:45:30 2019
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id EDA3EF52AE;
 Sun,  8 Sep 2019 14:45:30 +0000 (UTC)
 (envelope-from stefan.kanthak@nexgo.de)
Received: from mx009.vodafonemail.xion.oxcs.net
 (mx009.vodafonemail.xion.oxcs.net [153.92.174.39])
 (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
 (Client did not present a certificate)
 by mx1.freebsd.org (Postfix) with ESMTPS id 46RDc95bWHz3yfj;
 Sun,  8 Sep 2019 14:45:29 +0000 (UTC)
 (envelope-from stefan.kanthak@nexgo.de)
Received: from vsmx002.vodafonemail.xion.oxcs.net (unknown [192.168.75.192])
 by mta-6-out.mta.xion.oxcs.net (Postfix) with ESMTP id 0AFE060D643;
 Sun,  8 Sep 2019 14:45:27 +0000 (UTC)
Received: from H270 (unknown [93.230.223.140])
 by mta-6-out.mta.xion.oxcs.net (Postfix) with ESMTPA id 877DB60D6AE;
 Sun,  8 Sep 2019 14:45:22 +0000 (UTC)
Message-ID: <769CF9CBA0A34DFA92C739C970FA2AAF@H270>
From: "Stefan Kanthak" <stefan.kanthak@nexgo.de>
To: <freebsd-numerics@freebsd.org>,
	<freebsd-hackers@freebsd.org>
Subject: Shorter releng/12.0/lib/msun/i387/e_exp.S and
 releng/12.0/lib/msun/i387/s_finite.S
Date: Sun, 8 Sep 2019 16:37:03 +0200
Organization: Me, myself & IT
MIME-Version: 1.0
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Windows Mail 6.0.6002.18197
X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158
X-VADE-STATUS: LEGIT
X-Rspamd-Queue-Id: 46RDc95bWHz3yfj
X-Spamd-Bar: --
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of stefan.kanthak@nexgo.de designates
 153.92.174.39 as permitted sender) smtp.mailfrom=stefan.kanthak@nexgo.de
X-Spamd-Result: default: False [-2.35 / 15.00]; ARC_NA(0.00)[];
 RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.999,0];
 FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:153.92.174.0/24];
 TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain];
 TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[nexgo.de];
 HAS_ORG_HEADER(0.00)[]; RCVD_COUNT_THREE(0.00)[3];
 RCVD_TLS_LAST(0.00)[];
 RCVD_IN_DNSWL_MED(-0.20)[39.174.92.153.list.dnswl.org : 127.0.5.2];
 RCPT_COUNT_TWO(0.00)[2]; HAS_X_PRIO_THREE(0.00)[3];
 NEURAL_HAM_SHORT(-0.40)[-0.396,0];
 NEURAL_HAM_MEDIUM(-0.95)[-0.949,0];
 IP_SCORE(-0.00)[country: DE(-0.01)]; FROM_EQ_ENVFROM(0.00)[];
 R_DKIM_NA(0.00)[]; MID_RHS_NOT_FQDN(0.50)[];
 ASN(0.00)[asn:60664, ipnet:153.92.174.0/24, country:DE];
 MIME_TRACE(0.00)[0:+]
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Sun, 08 Sep 2019 14:45:31 -0000

Hi,

here's a patch to remove a conditional branch (and more) from
http://sources.freebsd.org/releng/12.0/lib/msun/i387/e_exp.S
plus a patch to shave some bytes (immediate operands) from
http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_finite.S

stay tuned
Stefan Kanthak

--- -/releng/12.0/lib/msun/i387/e_exp.S
+++ +/releng/12.0/lib/msun/i387/e_exp.S
@@ -45,7 +45,25 @@
         movl    8(%esp),%eax
-        andl    $0x7fffffff,%eax
-        cmpl    $0x7ff00000,%eax
-        jae     x_Inf_or_NaN
+        leal    (%eax+%eax),%edx
+        cmpl    $0xffe00000,%edx
+        jb      finite

+        /*
+         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
+         * this gives Inf, and when x is a NaN this gives the same result
+         * as (x + x) (x quieted).
+         */
+        cmpl    4(%esp),$0
+        sbbl    $0xfff00000,%eax
+        je      minus_inf
+
+nan:
         fldl    4(%esp)
+        ret

+minus_inf:
+        fldz
+        ret
+
+finite:
+        fldl    4(%esp)
+
@@ -80,19 +98,3 @@
         ret
-
-x_Inf_or_NaN:
-        /*
-         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
-         * this gives Inf, and when x is a NaN this gives the same result
-         * as (x + x) (x quieted).
-         */
-        cmpl    $0xfff00000,8(%esp)
-        jne     x_not_minus_Inf
-        cmpl    $0,4(%esp)
-        jne     x_not_minus_Inf
-        fldz
-        ret
-
-x_not_minus_Inf:
-        fldl    4(%esp)
-        ret
 END(exp)

--- -/releng/12.0/lib/msun/i387/s_finite.S
+++ +/releng/12.0/lib/msun/i387/s_finite.S
@@ -39,8 +39,8 @@
 ENTRY(finite)
         movl    8(%esp),%eax
-        andl    $0x7ff00000, %eax
-        cmpl    $0x7ff00000, %eax
+        addl    %eax, %eax
+        cmpl    $0xffe00000, %eax
         setneb  %al
-        andl    $0x000000ff, %eax
+        movzbl  %al, %eax
         ret
 END(finite)

From owner-freebsd-numerics@freebsd.org  Tue Sep 10 15:19:37 2019
Return-Path: <owner-freebsd-numerics@freebsd.org>
Delivered-To: freebsd-numerics@mailman.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
 by mailman.nyi.freebsd.org (Postfix) with ESMTP id C0BA2D983D;
 Tue, 10 Sep 2019 15:19:37 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au
 [211.29.132.246])
 by mx1.freebsd.org (Postfix) with ESMTP id 46STGb2vV0z4YSk;
 Tue, 10 Sep 2019 15:19:34 +0000 (UTC)
 (envelope-from brde@optusnet.com.au)
Received: from [192.168.0.103] (c110-21-101-228.carlnfd1.nsw.optusnet.com.au
 [110.21.101.228])
 by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id C5DAC43E52E;
 Wed, 11 Sep 2019 01:19:30 +1000 (AEST)
Date: Wed, 11 Sep 2019 01:19:29 +1000 (EST)
From: Bruce Evans <brde@optusnet.com.au>
X-X-Sender: bde@besplex.bde.org
To: Stefan Kanthak <stefan.kanthak@nexgo.de>
cc: freebsd-numerics@freebsd.org, freebsd-hackers@freebsd.org
Subject: Re: Shorter releng/12.0/lib/msun/i387/e_exp.S and
 releng/12.0/lib/msun/i387/s_finite.S
In-Reply-To: <769CF9CBA0A34DFA92C739C970FA2AAF@H270>
Message-ID: <20190910230930.Q1373@besplex.bde.org>
References: <769CF9CBA0A34DFA92C739C970FA2AAF@H270>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed
X-Optus-CM-Score: 0
X-Optus-CM-Analysis: v=2.2 cv=D+Q3ErZj c=1 sm=1 tr=0
 a=PalzARQSbocsUSjMRkwAPg==:117 a=PalzARQSbocsUSjMRkwAPg==:17
 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8
 a=RyLDTv5gonMbmEc0r54A:9 a=CjuIK1q_8ugA:10 a=IjZwj45LgO3ly-622nXo:22
X-Rspamd-Queue-Id: 46STGb2vV0z4YSk
X-Spamd-Bar: /
Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none;
 spf=pass (mx1.freebsd.org: domain of brde@optusnet.com.au designates
 211.29.132.246 as permitted sender) smtp.mailfrom=brde@optusnet.com.au
X-Spamd-Result: default: False [0.00 / 15.00]; ARC_NA(0.00)[];
 NEURAL_HAM_MEDIUM(-1.00)[-0.999,0]; RCVD_COUNT_TWO(0.00)[2];
 FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3];
 R_SPF_ALLOW(0.00)[+ip4:211.29.132.0/23];
 FREEMAIL_FROM(0.00)[optusnet.com.au];
 RBL_MAILSPIKE_WORST(2.00)[246.132.29.211.rep.mailspike.net : 127.0.0.10];
 MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[optusnet.com.au];
 TO_DN_SOME(0.00)[]; BAD_REP_POLICIES(0.10)[];
 NEURAL_HAM_LONG(-1.00)[-1.000,0]; IP_SCORE_FREEMAIL(0.00)[];
 TO_MATCH_ENVRCPT_SOME(0.00)[];
 IP_SCORE(0.00)[ip: (-6.96), ipnet: 211.28.0.0/14(-3.27), asn: 4804(-2.40),
 country: AU(0.01)]; RCVD_NO_TLS_LAST(0.10)[];
 RCVD_IN_DNSWL_LOW(-0.10)[246.132.29.211.list.dnswl.org : 127.0.5.1];
 R_DKIM_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[optusnet.com.au];
 ASN(0.00)[asn:4804, ipnet:211.28.0.0/14, country:AU];
 MIME_TRACE(0.00)[0:+]; FROM_EQ_ENVFROM(0.00)[]
X-BeenThere: freebsd-numerics@freebsd.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: "Discussions of high quality implementation of libm functions."
 <freebsd-numerics.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-numerics/>
List-Post: <mailto:freebsd-numerics@freebsd.org>
List-Help: <mailto:freebsd-numerics-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-numerics>, 
 <mailto:freebsd-numerics-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Tue, 10 Sep 2019 15:19:37 -0000

On Sun, 8 Sep 2019, Stefan Kanthak wrote:

I recently got diagnosed as having serious medical problems and am not sure
if I care about this...

> here's a patch to remove a conditional branch (and more) from
> http://sources.freebsd.org/releng/12.0/lib/msun/i387/e_exp.S
> plus a patch to shave some bytes (immediate operands) from
> http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_finite.S

Anyway, don't bother with these functions.  They should never have
been written in asm and should go away.

Improving the mod and remainder functions is more useful and difficult
since they are in asm on amd64 too and there seems to be no better way
to implement them on all x86 than to use the i387, but they are still
slow.

> --- -/releng/12.0/lib/msun/i387/e_exp.S
> +++ +/releng/12.0/lib/msun/i387/e_exp.S

This went away in my version in 2012 or 2013 together with implementing
the long double hyperbolic functions.  My version uses the same algorithm
in all precision for the hyperbolic functions, but only the long double
version was committed (in 2013).  The uncommitted parts are faster and
more accurate.  The same methods work relatively trivially for exp() and
expf(), except they are insignificantly faster than better d C version
after improving the accuracy of that to be slightly worse than the asm
version.  I gave up on plans to use the same algorithm in all precisions
for exp*().  The long double version is too sophisticated to be fast,
after developments in x86 CPUs and compilers made the old Sun C versions
fast.

Summary of implementations of exp*() on x86:
- expf(): use the same C version on amd64 and i386 (Cygnus translation of
   Sun version with some FreeBSD optimizations).  This is fast and is
   currently little less accurate than it should be
- exp(): use the C version on amd64 (Sun version with some FreeBSD
   optimizations).  This is fast and is currently little less accurate than
   it should be.  Use the asm version on i386.  This is slow since it switches
   the rounding precision.  It needs the 11 extra bits of precision to barely
   deliver a double precision result to within 1 ulp.

> @@ -45,7 +45,25 @@
>         movl    8(%esp),%eax
> -        andl    $0x7fffffff,%eax
> -        cmpl    $0x7ff00000,%eax
> -        jae     x_Inf_or_NaN
> +        leal    (%eax+%eax),%edx
> +        cmpl    $0xffe00000,%edx

This removes 1 instruction and 1 dependency, not a branch. Seems reasonable.
I would try to do it all in %eax.  Check what compilers do for the C version
of finite() where this check is clearer and easier to optimize (see below).
All of this can be written in C with about 1 line of inline asm, and then
compilers can generate better code.

> +        jb      finite

This seems to pessimize the branch logic in all cases (as would be done in
C by getting __predict_mumble() backwards).

The branches were carefully optimized (hopefully not backwards) for the i386
and i486 and this happens to be best for later CPUs too.  Taken branches are
slower on old CPUs, so the code was arranged to not branch in the usual
(finite) case.  Newer CPUs only use static branch prediction for the first
branch, so the branch organization rarely matters except in large code (not
like here) where moving the unusual case far away is good for caching.  The
static prediction is usuually that the first forward branch is not taken
while the first backward branch is taken.  So the forward branch to the
non-finite case was accidentally correct.

>
> +        /*
> +         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
> +         * this gives Inf, and when x is a NaN this gives the same result
> +         * as (x + x) (x quieted).
> +         */
> +        cmpl    4(%esp),$0
> +        sbbl    $0xfff00000,%eax
> +        je      minus_inf
> +
> +nan:
>         fldl    4(%esp)
> +        ret
>
> +minus_inf:
> +        fldz
> +        ret
> +
> +finite:
> +        fldl    4(%esp)
> +
> @@ -80,19 +98,3 @@
>         ret
> -
> -x_Inf_or_NaN:
> -        /*
> -         * Return 0 if x is -Inf.  Otherwise just return x; when x is Inf
> -         * this gives Inf, and when x is a NaN this gives the same result
> -         * as (x + x) (x quieted).
> -         */
> -        cmpl    $0xfff00000,8(%esp)
> -        jne     x_not_minus_Inf
> -        cmpl    $0,4(%esp)
> -        jne     x_not_minus_Inf
> -        fldz
> -        ret
> -
> -x_not_minus_Inf:
> -        fldl    4(%esp)
> -        ret

Details not checked.  Space/time efficiency doesn't matter in the non-finite
case.  But see s_expl.c where the magic expression (-1 / x) is used for the
return value to optimize for space (it avoids branches but the division is
slow).

> END(exp)
>
> --- -/releng/12.0/lib/msun/i387/s_finite.S
> +++ +/releng/12.0/lib/msun/i387/s_finite.S

This function has several layers of reasons to not exist.  It seems to be
only a Sun extension to C90.  It is not declared in <math.h>, but exists
in libm as namespace pollution to support old ABIs.  C99 has the better
API isfinite() which is type-generic.  I thought that this was usually
inlined.  Actually, it seems to be implemented by calling __isfinite(),
and not this finite().  libm also has finite() in C.  Not inlining this
and/or having no way to know if it is efficiently inlined makes it unusable
in optimized code.

> @@ -39,8 +39,8 @@
> ENTRY(finite)
>         movl    8(%esp),%eax
> -        andl    $0x7ff00000, %eax
> -        cmpl    $0x7ff00000, %eax
> +        addl    %eax, %eax
> +        cmpl    $0xffe00000, %eax

This doesn't reduce the number of instructions or dependencies, so it is
less worth doing than similar changes above.

>         setneb  %al

This is now broken since setneb is only correct after masking out the
unimportant bits.

> -        andl    $0x000000ff, %eax
> +        movzbl  %al, %eax
>         ret

Old bug: extra instructions to avoid the branch might be a pessimization
all CPUs:
- perhaps cmov is best on newer CPUs, but it is unportable
- the extra instructions and possibly movz instead of and are slower on
   old CPUs, while branch prediction is fast for the usual case on newer
   CPUs.

> END(finite)

Check what compilers generate for the C versions of finite() and
__isfinite() with -fomit-frame-pointer -march=mumble (especially i386)
and __predict_mumble().  The best code (better than the above) is for
finite().  Oops, it is only gcc-4.2.1 that generates very bad code for
__isfinite().  s_finite.c uses masks and compilers don't reorganize
this much.  s_isfinite.c uses hard-coded bit-fields which some compilers
don't optimize very well.  Neither does the above, or the standard
access macros using bit-fields -- they tend to produce store-to-load
mismatches.

Well, I finally found where this is inlined.  Use __builtin_isfinite()
instead of isfinite().  Then gcc generates a libcall to __builtin_isfinite(),
while clang generates inline code which is much larger and often slower
than any of the above, but it at least avoids store-to-load mismatches
and doesn't misclassify long doubles in unsupported formats as finite when
they are actually NaNs.  It also generates exceptions for signaling NaNs in
some cases, which is arguably wrong.

The fpclassify and isfinite, etc., macros in <math.h> are already too
complicated but not nearly complicated enough to decide if the corresponding
builtins should be used.

Bruce