Date: Tue, 31 Jan 2017 19:47:53 -0800 From: Conrad Meyer <cem@freebsd.org> To: Bruce Evans <brde@optusnet.com.au> Cc: src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <CAG6CVpX0Y1PSODjnr2aY%2BybM3rKfr-KF3qAMXGpG669fYG7WXQ@mail.gmail.com> In-Reply-To: <20170201123838.X1974@besplex.bde.org> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com> <20170201123838.X1974@besplex.bde.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 31, 2017 at 7:16 PM, Bruce Evans <brde@optusnet.com.au> wrote: > Another reply to this... > > On Tue, 31 Jan 2017, Conrad Meyer wrote: > >> On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrot= e: >>> >>> On Tue, 31 Jan 2017, Bruce Evans wrote: >>> I >>> think there should by no alignment on entry -- just assume the buffer i= s >>> aligned in the usual case, and only run 4% slower when it is misaligned= . >> >> >> Please write such a patch and demonstrate the improvement. > > > It is easy to demonstrate. I just put #if 0 around the early alignment > code. The result seem too good to be true, so maybe I missed some > later dependency on alignment of the addresses: > - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with > alignment and 1.02 seconds without alignment. > This actually makes sense, 128 bytes can be done with 16 8-byte unaligned > crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte= . > 7 more 3-cycle instructions and overhead too is far more than the cost > of letting the CPU do read-combining. > - for 4096-byte buffers, the difference is insignificant (0.47 seconds fo= r > 10g. I believe it, especially for newer amd64. I seem to recall that older x86 machines had a higher misalignment penalty, but it was largely reduced in (?)Nehalem. Why don't you go ahead and commit that change? >> perhaps we could just remove the 3x8k table =E2=80=94 I'm not sure >> it adds any benefit over the 3x256 table. > > > Good idea, but the big table is useful. Ifdefing out the LONG case reduc= es > the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the > setup below. Ifdefing out the SHORT case only reduces to ~0.39 seconds. Interesting. > I hoped that an even shorter SHORT case would work. I think it now handl= es > 768 bytes (3 * SHORT) in the inner loop. Right. > That is 32 sets of 3 crc32q's, > and I would have thought that update at the end would take about as long > as 1 iteration (3%), but it apparently takes 33%. The update at the end may be faster with PCLMULQDQ. There are versions of this algorithm that use that in place of the lookup-table combine (for example, Linux has a permissively licensed implementation here: http://lxr.free-electrons.com/source/arch/x86/crypto/crc32c-pcl-intel= -asm_64.S ). Unfortunately, PCLMULQDQ uses FPU state, which is inappropriate most of the time in kernel mode. It could be used opportunistically if the thread is already in FPU-save mode or if the input is "big enough" to make it worth it. >>> Your benchmarks mainly give results for the <=3D 768 bytes where most o= f >>> the manual optimizations don't apply. >> >> >> 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) >> 0x000800: asm:132 intrins:133 (ns per buf) >> 0x002000: asm:449 intrins:446 (ns per buf) >> 0x008000: asm:1501 intrins:1497 (ns per buf) >> 0x020000: asm:5618 intrins:5609 (ns per buf) >> >> (All routines are in a separate compilation unit with no full-program >> optimization, as they are in the kernel.) > > > These seem slow. I modified my program to test the actual kernel code, > and get for 10gB on freefall's Xeon (main times in seconds): Freefall has a Haswell Xeon @ 3.3GHz. My laptop is a Sandy Bridge Core i5 @ 2.6 GHz. That may help explain the difference. Best, Conrad
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAG6CVpX0Y1PSODjnr2aY%2BybM3rKfr-KF3qAMXGpG669fYG7WXQ>