From owner-freebsd-numerics@freebsd.org Sun Sep 8 05:55:51 2019 Return-Path: Delivered-To: freebsd-numerics@mailman.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.nyi.freebsd.org (Postfix) with ESMTP id 04E15E9287; Sun, 8 Sep 2019 05:55:51 +0000 (UTC) (envelope-from stefan.kanthak@nexgo.de) Received: from vsmx009.vodafonemail.xion.oxcs.net (vsmx009.vodafonemail.xion.oxcs.net [153.92.174.87]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 46R0s10sg7z4Wrk; Sun, 8 Sep 2019 05:55:48 +0000 (UTC) (envelope-from stefan.kanthak@nexgo.de) Received: from vsmx001.vodafonemail.xion.oxcs.net (unknown [192.168.75.191]) by mta-5-out.mta.xion.oxcs.net (Postfix) with ESMTP id 2967015A7FF7; Sun, 8 Sep 2019 05:55:46 +0000 (UTC) Received: from H270 (unknown [93.230.223.140]) by mta-5-out.mta.xion.oxcs.net (Postfix) with ESMTPA id 8F3C515A8141; Sun, 8 Sep 2019 05:55:39 +0000 (UTC) Message-ID: <174BDDD122964DA9AD32D77663AB863D@H270> From: "Stefan Kanthak" To: , Cc: Subject: Shorterr releng/12.0/lib/msun/i387/s_remquo.S, releng/12.0/lib/msun/amd64/s_remquo.S, ... Date: Sun, 8 Sep 2019 07:52:46 +0200 Organization: Me, myself & IT MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Windows Mail 6.0.6002.18197 X-MimeOLE: Produced By Microsoft MimeOLE V6.1.7601.24158 X-VADE-STATUS: LEGIT X-Rspamd-Queue-Id: 46R0s10sg7z4Wrk X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of stefan.kanthak@nexgo.de designates 153.92.174.87 as permitted sender) smtp.mailfrom=stefan.kanthak@nexgo.de X-Spamd-Result: default: False [-2.06 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; NEURAL_HAM_MEDIUM(-0.93)[-0.932,0]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:153.92.174.0/24]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[text/plain]; MIME_TRACE(0.00)[0:+]; TO_DN_NONE(0.00)[]; DMARC_NA(0.00)[nexgo.de]; HAS_ORG_HEADER(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_LONG(-1.00)[-0.997,0]; NEURAL_HAM_SHORT(-0.23)[-0.229,0]; HAS_X_PRIO_THREE(0.00)[3]; IP_SCORE(-0.00)[country: DE(-0.01)]; FROM_EQ_ENVFROM(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[87.174.92.153.list.dnswl.org : 127.0.5.1]; R_DKIM_NA(0.00)[]; ASN(0.00)[asn:60664, ipnet:153.92.174.0/24, country:DE]; MID_RHS_NOT_FQDN(0.50)[]; RECEIVED_SPAMHAUS_PBL(0.00)[140.223.230.93.khpj7ygk5idzvmvt5x4ziurxhy.zen.dq.spamhaus.net : 127.0.0.10] X-BeenThere: freebsd-numerics@freebsd.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: "Discussions of high quality implementation of libm functions." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 08 Sep 2019 05:55:51 -0000 Hi, here's a patch to shave 4 instructions (and about 25% code size) from http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquo.S http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquof.S http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquol.S http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquo.S http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquof.S http://sources.freebsd.org/releng/12.0/lib/msun/amd64/s_remquol.S Especially the negation is rather clumsy: 1. the 2 shifts by 16 to propagate the sign to all bits can be replaced with a single shift by 31, or with a CLTD alias CDQ (which is 2 bytes shorter); 2. the conversion of -1 to +1 via AND and its addition can be replaced by subtraction of -1. The minor differences between the code for the float, double and long double as well as the i387 and amd64 implementations are intended; pick the variant you like best. I prefer and recommend the variant with 3 ADC and 2 SHL instructions used for the i387 double-precision function http://sources.freebsd.org/releng/12.0/lib/msun/i387/s_remquo.S, which comes first. stay tuned Stefan Kanthak PS: if you ever need to run these functions on a CPU without barrel shifter, replace the first SHL or ROR with BT $14,%eax and the second SHL or ROL with BT $9,%eax ... and hope that BT doesn't use a slow shift under the hood. --- -/releng/12.0/lib/msun/i387/s_remquo.S +++ +/releng/12.0/lib/msun/i387/s_remquo.S @@ -34,1 +34,2 @@ ENTRY(remquo) + xorl %ecx,%ecx @@ -42,22 +43,17 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + adcl %ecx,%ecx + shll $18,%eax + adcl %ecx,%ecx + shll $5,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl 16(%esp),%ecx - xorl 8(%esp),%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl 16(%esp),%eax + xorl 8(%esp),%eax + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl 20(%esp),%ecx - movl %eax,(%ecx) + movl 20(%esp),%eax + movl %ecx,(%eax) ret END(remquo) --- -/releng/12.0/lib/msun/i387/s_remquof.S +++ +/releng/12.0/lib/msun/i387/s_remquof.S @@ -42,22 +42,18 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + sbbl %ecx,%ecx + negl %ecx + shll $18,%eax + adcl %ecx,%ecx + shll $5,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl 8(%esp),%ecx - xorl 4(%esp),%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl 8(%esp),%eax + xorl 4(%esp),%eax + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl 12(%esp),%ecx - movl %eax,(%ecx) + movl 12(%esp),%eax + movl %ecx,(%eax) ret END(remquof) --- -/releng/12.0/lib/msun/i387/s_remquol.S +++ +/releng/12.0/lib/msun/i387/s_remquol.S @@ -42,22 +42,19 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + setc %cl + movzbl %cl,%ecx + shll $18,%eax + adcl %ecx,%ecx + shll $5,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl 24(%esp),%ecx - xorl 12(%esp),%ecx - movsx %cx,%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl 24(%esp),%eax + xorl 12(%esp),%eax + cwtl + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl 28(%esp),%ecx - movl %eax,(%ecx) + movl 28(%esp),%eax + movl %ecx,(%eax) ret +END(remquol) --- -/releng/12.0/lib/msun/amd64/s_remquo.S --- +/releng/12.0/lib/msun/amd64/s_remquo.S @@ -34,1 +35,2 @@ ENTRY(remquo) + xorl %ecx,%ecx @@ -44,19 +45,14 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + adcl %ecx,%ecx + rorl $15,%eax + adcl %ecx,%ecx + roll $6,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl -12(%rsp),%ecx - xorl -4(%rsp),%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl -12(%rsp),%eax + xorl -4(%rsp),%eax + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl %eax,(%rdi) + movl %ecx,(%rdi) --- -/releng/12.0/lib/msun/amd64/s_remquof.S --- +/releng/12.0/lib/msun/amd64/s_remquof.S @@ -44,19 +44,15 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + sbbl %ecx,%ecx + negl %ecx + rorl $15,%eax + adcl %ecx,%ecx + roll $6,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl -8(%rsp),%ecx - xorl -4(%rsp),%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl -8(%rsp),%eax + xorl -4(%rsp),%eax + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl %eax,(%rdi) + movl %ecx,(%rdi) --- -/releng/12.0/lib/msun/amd64/s_remquol.S --- +/releng/12.0/lib/msun/amd64/s_remquol.S @@ -42,21 +42,18 @@ /* Extract the three low-order bits of the quotient from C0,C3,C1. */ - shrl $6,%eax - movl %eax,%ecx - andl $0x108,%eax - rorl $7,%eax - orl %eax,%ecx - roll $4,%eax - orl %ecx,%eax - andl $7,%eax + setc %cl + movzbl %cl,%ecx + rorl $15,%eax + adcl %ecx,%ecx + roll $6,%eax + adcl %ecx,%ecx /* Negate the quotient bits if x*y<0. Avoid using an unpredictable branch. */ - movl 32(%rsp),%ecx - xorl 16(%rsp),%ecx - movsx %cx,%ecx - sarl $16,%ecx - sarl $16,%ecx - xorl %ecx,%eax - andl $1,%ecx - addl %ecx,%eax + movl 32(%rsp),%eax + xorl 16(%rsp),%eax + cwtl + cltd + xorl %edx,%ecx + subl %edx,%ecx /* Store the quotient and return. */ - movl %eax,(%rdi) + movl %ecx,(%rdi) ret +END(remquol)