From owner-svn-src-head@freebsd.org Wed Feb 1 10:42:41 2017 Return-Path: Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0FC75CC9A80; Wed, 1 Feb 2017 10:42:41 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 998CA16D1; Wed, 1 Feb 2017 10:42:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c122-106-153-191.carlnfd1.nsw.optusnet.com.au [122.106.153.191]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 1FCEE1046AD9; Wed, 1 Feb 2017 21:42:36 +1100 (AEDT) Date: Wed, 1 Feb 2017 21:42:35 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Conrad Meyer cc: src-committers , svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern In-Reply-To: Message-ID: <20170201203814.O953@besplex.bde.org> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> <20170201123838.X1974@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=BKLDlBYG c=1 sm=1 tr=0 a=Tj3pCpwHnMupdyZSltBt7Q==:117 a=Tj3pCpwHnMupdyZSltBt7Q==:17 a=kj9zAlcOel0A:10 a=PO7r1zJSAAAA:8 a=uP1ucDPQAAAA:8 a=sAtkCSKGeD0KkG_lp3sA:9 a=CjuIK1q_8ugA:10 a=Oa0T6EYmKFNB-xRHvYM1:22 a=9a9ggB8z3XFZH39hjkD6:22 X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 01 Feb 2017 10:42:41 -0000 On Tue, 31 Jan 2017, Conrad Meyer wrote: > On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans wrote: >> Another reply to this... >> >> On Tue, 31 Jan 2017, Conrad Meyer wrote: >> >>> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans wrote: >>>> >>>> On Tue, 31 Jan 2017, Bruce Evans wrote: >>>> I >>>> think there should by no alignment on entry -- just assume the buffer is >>>> aligned in the usual case, and only run 4% slower when it is misaligned. >>> >>> Please write such a patch and demonstrate the improvement. >> >> It is easy to demonstrate. I just put #if 0 around the early alignment >> code. The result seem too good to be true, so maybe I missed some >> later dependency on alignment of the addresses: >> - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with >> alignment and 1.02 seconds without alignment. >> This actually makes sense, 128 bytes can be done with 16 8-byte unaligned >> crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte. >> 7 more 3-cycle instructions and overhead too is far more than the cost >> of letting the CPU do read-combining. >> - for 4096-byte buffers, the difference is insignificant (0.47 seconds for >> 10g. > > I believe it, especially for newer amd64. I seem to recall that older > x86 machines had a higher misalignment penalty, but it was largely > reduced in (?)Nehalem. Why don't you go ahead and commit that change? Needs more work. Here are fairly clean patches for the unportabilities. I can't really test these in the kernel. Merge the asms with yours if yours are better in parts. X Index: conf/files.amd64 X =================================================================== X --- conf/files.amd64 (revision 313007) X +++ conf/files.amd64 (working copy) X @@ -536,6 +536,9 @@ X isa/vga_isa.c optional vga X kern/kern_clocksource.c standard X kern/link_elf_obj.c standard X +libkern/x86/crc32_sse42.c standard X +libkern/memmove.c standard X +libkern/memset.c standard X # X # IA32 binary support X # X @@ -593,14 +596,6 @@ X compat/ndis/subr_usbd.c optional ndisapi pci X compat/ndis/winx64_wrap.S optional ndisapi pci X # X -crc32_sse42.o standard \ X - dependency "$S/libkern/x86/crc32_sse42.c" \ X - compile-with "${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ X - no-implicit-rule \ X - clean "crc32_sse42.o" X -libkern/memmove.c standard X -libkern/memset.c standard X -# X # x86 real mode BIOS emulator, required by dpms/pci/vesa X # X compat/x86bios/x86bios.c optional x86bios | dpms | pci | vesa This also fixes unsorting by crc32_sse42.o and nearby unsorting by all other libkern lines (just 2) in files.amd64. This file is still grossly unsorted near the end. X Index: conf/files.i386 X =================================================================== X --- conf/files.i386 (revision 313007) X +++ conf/files.i386 (working copy) X @@ -554,11 +554,6 @@ X kern/imgact_aout.c optional compat_aout X kern/imgact_gzip.c optional gzip X kern/subr_sfbuf.c standard X -crc32_sse42.o standard \ X - dependency "$S/libkern/x86/crc32_sse42.c" \ X - compile-with "${CC} -c ${CFLAGS:N-nostdinc} ${WERROR} ${PROF} -msse4 ${.IMPSRC}" \ X - no-implicit-rule \ X - clean "crc32_sse42.o" X libkern/divdi3.c standard X libkern/ffsll.c standard X libkern/flsll.c standard X @@ -569,6 +564,7 @@ X libkern/ucmpdi2.c standard X libkern/udivdi3.c standard X libkern/umoddi3.c standard X +libkern/x86/crc32_sse42.c standard X i386/xbox/xbox.c optional xbox X i386/xbox/xboxfb.c optional xboxfb X dev/fb/boot_font.c optional xboxfb This also fixes unsorting by crc32_sse42.o. files.i386 is not as unsorted as files.i386. The rules for sorting generated files are unclear. They are not in source directories so are nor naturally grouped. Old ones are mostly placed near the beginning. Some are sorted on a pathname of the source file(s) in scattered places in the make rules. crc32_sse42.o is partly like that. It was sorted with libkern sources but not by the x86 part. X Index: libkern/x86/crc32_sse42.c X =================================================================== X --- libkern/x86/crc32_sse42.c (revision 313007) X +++ libkern/x86/crc32_sse42.c (working copy) X @@ -31,15 +31,41 @@ X */ X #ifdef USERSPACE_TESTING X #include X +#include Fix dependency on namespace pollution in for at least size_t. X #else X #include X +#include X #include X -#include X -#include X #endif Fix order of systm.h and not using its standard pollution. X X -#include X +static __inline uint32_t X +_mm_crc32_u8(uint32_t x, uint8_t y) X +{ X + /* X + * clang (at least 3.9.[0-1]) pessimizes "rm" (y) and "m" (y) X + * significantly and "r" (y) a lot by copying y to a different X + * local variable (on the stack or in a register), so only use X + * the latter. This costs a register and an instruction but X + * not a uop. X + */ X + __asm("crc32b %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X X +static __inline uint32_t X +_mm_crc32_u32(uint32_t x, int32_t y) X +{ X + __asm("crc32l %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X +static __inline uint64_t X +_mm_crc32_u64(uint64_t x, int64_t y) X +{ X + __asm("crc32q %1,%0" : "+r" (x) : "r" (y)); X + return (x); X +} X + X /* CRC-32C (iSCSI) polynomial in reversed bit order. */ X #define POLY 0x82f63b78 X >> I hoped that an even shorter SHORT case would work. I think it now handles >> 768 bytes (3 * SHORT) in the inner loop. > > Right. > >> That is 32 sets of 3 crc32q's, >> and I would have thought that update at the end would take about as long >> as 1 iteration (3%), but it apparently takes 33%. > > The update at the end may be faster with PCLMULQDQ. There are > versions of this algorithm that use that in place of the lookup-table > combine (for example, Linux has a permissively licensed implementation > here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel-asm_64.S > ). > > Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most > of the time in kernel mode. It could be used opportunistically if the > thread is already in FPU-save mode or if the input is "big enough" to > make it worth it. That rules it out for most kernel purposes. I will try to optimize the multiplication. >>> 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) >>> 0x000800: asm:132 intrins:133 (ns per buf) >>> 0x002000: asm:449 intrins:446 (ns per buf) >>> 0x008000: asm:1501 intrins:1497 (ns per buf) >>> 0x020000: asm:5618 intrins:5609 (ns per buf) >>> >>> (All routines are in a separate compilation unit with no full-program >>> optimization, as they are in the kernel.) >> >> These seem slow. I modified my program to test the actual kernel code, >> and get for 10gB on freefall's Xeon (main times in seconds): > > Freefall has a Haswell Xeon @ 3.3GHz. My laptop is a Sandy Bridge > Core i5 @ 2.6 GHz. That may help explain the difference. I thought that Sandy Bridge would be almost as fast here. ref11-i386's lesser Xeon is 5.5 times slower in cycles for a block size of 128K and 3.7 times slower for a block size of 128. i386 is naturally twice as slow. ref11-amd64's lesser Xeon has the same speed as freefall. So there must be a bug in the i386 code... it is simply that gcc doesn't optimize the buggy uint64_t type for the crcs in the 32-bit case, but clang does. After fixing this, the i386 case is only slightly more than twice as slow. Untested fix: Y Index: libkern/x86/crc32_sse42.c Y =================================================================== Y --- libkern/x86/crc32_sse42.c (revision 313007) Y +++ libkern/x86/crc32_sse42.c (working copy) Y @@ -190,7 +216,11 @@ Y const size_t align = 4; Y #endif Y const unsigned char *next, *end; Y - uint64_t crc0, crc1, crc2; /* need to be 64 bits for crc32q */ Y +#ifdef __amd64__ Y + uint64_t crc0, crc1, crc2; Y +#else Y + uint32_t crc0, crc1, crc2; Y +#endif Y Y next = buf; Y crc0 = crc; Bruce