From owner-svn-src-head@freebsd.org Sun Sep 4 15:56:52 2016 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F21B8A9D7CD; Sun, 4 Sep 2016 15:56:52 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 88538391; Sun, 4 Sep 2016 15:56:51 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id C727DD485B9; Mon, 5 Sep 2016 01:56:48 +1000 (AEST) Date: Mon, 5 Sep 2016 01:56:48 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: src-committers@FreeBSD.org, svn-src-all@FreeBSD.org, svn-src-head@FreeBSD.org Subject: Re: svn commit: r305382 - in head/lib/msun: amd64 i387 In-Reply-To: <20160904144859.GC83214@kib.kiev.ua> Message-ID: <20160905012859.L6221@besplex.bde.org> References: <201609041222.u84CMEdM033135@repo.freebsd.org> <20160904144859.GC83214@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=djm5zUDtHYgCdv4C8OwA:9 a=CjuIK1q_8ugA:10 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 04 Sep 2016 15:56:53 -0000 On Sun, 4 Sep 2016, Konstantin Belousov wrote: > On Sun, Sep 04, 2016 at 12:22:14PM +0000, Bruce Evans wrote: > ... >> Log: >> Add asm versions of fmod(), fmodf() and fmodl() on amd64. Add asm >> versions of fmodf() amd fmodl() on i387. >> ... > It seems that wrong version of i387/f_fmodf.S, it is identical to the > amd64 version. Indeed. Fixed. >> Added: head/lib/msun/amd64/e_fmod.S >> ============================================================================== >> --- /dev/null 00:00:00 1970 (empty, because file is newly added) >> +++ head/lib/msun/amd64/e_fmod.S Sun Sep 4 12:22:14 2016 (r305382) >> +ENTRY(fmod) >> + movsd %xmm0,-8(%rsp) >> + movsd %xmm1,-16(%rsp) >> + fldl -16(%rsp) >> + fldl -8(%rsp) >> +1: fprem >> + fstsw %ax >> + testw $0x400,%ax >> + jne 1b >> + fstpl -8(%rsp) >> + movsd -8(%rsp),%xmm0 >> + fstp %st >> + ret >> +END(fmod) > > I see that this is not a new approach in the amd64 subdirectory, to use > x87 FPU on amd64. Please note that it might have non-obvious effects on > the performance, in particular, on the speed of the context switches and > handling of #NM exception. For long double functions, the i387 gets used anyway. This function is very slow even with the i387. It takes about 500 cycles per call on args uniformly distributed in double precision space, but this distribution is very non-average since it gives many huge args. The loop iterates many times on huge args. This is still better the the C code which takes 3 or more times longer or > 1500 cycles. It does a loop on the bits using integer code. The C code is relatively even slower when there are fewer bits (something like 9 times slower for args uniformly distributed in float precision space). > Newer Intel and possibly AMD CPUs have an optimization which allows > coprocessor code to save and restore state to not save and restore state > which was not changed. In other words, for typical amd64 binary which > uses %xmm register file but did not touched %st nor %ymm, only %xmm > bits are spilled and then loaded. Touching %st defeats the optimization, > possible for the whole lifetime of the thread. > > This feature (XSAVEOPT) is available at least starting from Haswell > microarchitecture, not sure about IvyBridge. Isn't the i386 space too small to matter much? There should be the same number of NM#'s and just 100 bytes extra to save. Avoiding use of larger register sets by using only the i387 might save more :-). The other amd64 asm uses of the i387 for floats and doubles are: - 3 files for remainder and 3 files for remquo. Needed for the same reason as for fmod - s_scalbn.S, s_scalbnf.S. To use i387 fscale. Probably a mistake. The functions themselves are too slow to be very useful too. libm almost never uses them internally, and in optimized functions like exp* the exponent scaling is done inline using special integer code. I have spent many hours fighting the compiler to stop it pessimizing the memory accesses to give pipeline stalls for this integer code. Using fscale probably tends to give another type of pipeline stall. I plan to remove many more i387 uses on i386, but there aren't many more on amd64. Bruce