Date: Wed, 1 Feb 2017 14:16:47 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Conrad Meyer <cem@freebsd.org> Cc: Bruce Evans <brde@optusnet.com.au>, src-committers <src-committers@freebsd.org>, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: Re: svn commit: r313006 - in head: sys/conf sys/libkern sys/libkern/x86 sys/sys tests/sys/kern Message-ID: <20170201123838.X1974@besplex.bde.org> In-Reply-To: <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com> References: <201701310326.v0V3QW30024375@repo.freebsd.org> <20170131153411.G1061@besplex.bde.org> <CAG6CVpXW0Gx6GfxUz_4_u9cGFJdt2gOcGsuphbP9YjkyYMYU2g@mail.gmail.com> <20170131175309.N1418@besplex.bde.org> <20170201005009.E2504@besplex.bde.org> <CAG6CVpV34Ad=GvqqXdxPc8y2OO=f5GvR9auJXOXG9t9fARBi4Q@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Another reply to this... On Tue, 31 Jan 2017, Conrad Meyer wrote: > On Tue, Jan 31, 2017 at 7:36 AM, Bruce Evans <brde@optusnet.com.au> wrote= : >> On Tue, 31 Jan 2017, Bruce Evans wrote: >> Unrolling (or not) may be helpful or harmful for entry and exit code. > > Helpful, per my earlier benchmarks. > >> I >> think there should by no alignment on entry -- just assume the buffer is >> aligned in the usual case, and only run 4% slower when it is misaligned. > > Please write such a patch and demonstrate the improvement. It is easy to demonstrate. I just put #if 0 around the early alignment code. The result seem too good to be true, so maybe I missed some later dependency on alignment of the addresses: - for 128-byte buffers and misalignment of 3, 10g takes 1.48 seconds with alignment and 1.02 seconds without alignment. This actually makes sense, 128 bytes can be done with 16 8-byte unaligned crc32q's. The alignment code makes it do 15 * 8-but and (5 + 3) * 1-byte. 7 more 3-cycle instructions and overhead too is far more than the cost of letting the CPU do read-combining. - for 4096-byte buffers, the difference is insignificant (0.47 seconds for 10g. >> I >> don't understand the algorithm for joining crcs -- why doesn't it work >> to reduce to 12 or 24 bytes in the main loop? > >It would, but I haven't implemented or tested that. You're welcome to >do so and demonstrate an improvement. It does add more lookup table >bloat, but perhaps we could just remove the 3x8k table =E2=80=94 I'm not s= ure >it adds any benefit over the 3x256 table. Good idea, but the big table is useful. Ifdefing out the LONG case reduces the speed for large buffers from ~0.35 seconds to ~0.43 seconds in the setup below. Ifdefing out the SHORT case only reduces to ~0.39 seconds. I hoped that an even shorter SHORT case would work. I think it now handles 768 bytes (3 * SHORT) in the inner loop. That is 32 sets of 3 crc32q's, and I would have thought that update at the end would take about as long as 1 iteration (3%), but it apparently takes 33%. >> ... >> Your benchmarks mainly give results for the <=3D 768 bytes where most of >> the manual optimizations don't apply. > > 0x000400: asm:68 intrins:62 multitable:684 (ns per buf) > 0x000800: asm:132 intrins:133 (ns per buf) > 0x002000: asm:449 intrins:446 (ns per buf) > 0x008000: asm:1501 intrins:1497 (ns per buf) > 0x020000: asm:5618 intrins:5609 (ns per buf) > > (All routines are in a separate compilation unit with no full-program > optimization, as they are in the kernel.) These seem slow. I modified my program to test the actual kernel code, and get for 10gB on freefall's Xeon (main times in seconds): 0x000008: asm(rm):3.41 asm(r):3.07 intrins:6.01 gcc:3.74 (3S =3D 2.4ns/buf= ) 0x000010: asm(rm):2.05 asm(r):1.70 intrins:2.92 gcc:2.62 (2S =3D 3/2ns/buf= ) 0x000020: asm(rm):1.63 asm(r):1.58 intrins:1.62 gcc:1.61 (1.6S =3D 5.12ns/= buf) 0x000040: asm(rm):1.07 asm(r):1.11 intrins:1.06 gcc:1.14 (1.1S =3D 7.04ns/= buf) 0x000080: asm(rm):1.02 asm(r):1.04 intrins:1.03 gcc:1.04 (1.02S =3D 13.06n= s/buf) 0x000100: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.08 (1.02S =3D 52.22n= s/buf) 0x000200: asm(rm):1.02 asm(r):1.02 intrins:1.02 gcc:1.02 (1.02S =3D 104.45= ns/buf) 0x000400: asm(rm):0.58 asm(r):0.57 intrins:0.57 gcc:0.57 (.57S =3D 116.43n= s/buf) 0x001000: asm(rm):0.62 asm(r):0.57 intrins:0.57 gcc:0.57 (.57S =3D 233.44n= s/buf) 0x002000: asm(rm):0.48 asm(r):0.46 intrins:0.46 gcc:0.46 (.46S =3D 376.83n= s/buf) 0x004000: asm(rm):0.49 asm(r):0.46 intrins:0.46 gcc:0.46 (.46S =3D 753.66n= s/buf) 0x008000: asm(rm):0.49 asm(r):0.38 intrins:0.38 gcc:0.38 (.38S =3D 1245.18= ns/buf) 0x010000: asm(rm):0.47 asm(r):0.38 intrins:0.36 gcc:0.38 (.36S =3D 2359.30= ns/buf) 0x020000: asm(rm):0.43 asm(r):1.05 intrins:0.35 gcc:0.36 (.35S =3D 4587.52= ns/buf) asm(r) is a fix for clang's slownes with inline asms. Just change the constraint from "rm" to "r". This takes an extra register, but no more uops. This is for the aligned case with no hacks. intrins does something bad for small buffers. Probably just the branch ove= r the dead unrolling. Twice 2.4ns/buf for 8-byte buffers is still very fast. This is 16 cycles. 3 cycles to do 1 crc32q and the rest mainly for 1 funct= ion call and too many branches. Bruce From owner-svn-src-head@freebsd.org Wed Feb 1 03:29:15 2017 Return-Path: <owner-svn-src-head@freebsd.org> Delivered-To: svn-src-head@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 549DFCC93A4; Wed, 1 Feb 2017 03:29:15 +0000 (UTC) (envelope-from jhibbits@FreeBSD.org) Received: from repo.freebsd.org (repo.freebsd.org [IPv6:2610:1c1:1:6068::e6a:0]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2F3D43C2; Wed, 1 Feb 2017 03:29:15 +0000 (UTC) (envelope-from jhibbits@FreeBSD.org) Received: from repo.freebsd.org ([127.0.1.37]) by repo.freebsd.org (8.15.2/8.15.2) with ESMTP id v113TE4h037476; Wed, 1 Feb 2017 03:29:14 GMT (envelope-from jhibbits@FreeBSD.org) Received: (from jhibbits@localhost) by repo.freebsd.org (8.15.2/8.15.2/Submit) id v113TEPn037471; Wed, 1 Feb 2017 03:29:14 GMT (envelope-from jhibbits@FreeBSD.org) Message-Id: <201702010329.v113TEPn037471@repo.freebsd.org> X-Authentication-Warning: repo.freebsd.org: jhibbits set sender to jhibbits@FreeBSD.org using -f From: Justin Hibbits <jhibbits@FreeBSD.org> Date: Wed, 1 Feb 2017 03:29:14 +0000 (UTC) To: src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org Subject: svn commit: r313036 - in head/sys/powerpc: booke include X-SVN-Group: head MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SVN commit messages for the src tree for head/-current <svn-src-head.freebsd.org> List-Unsubscribe: <https://lists.freebsd.org/mailman/options/svn-src-head>, <mailto:svn-src-head-request@freebsd.org?subject=unsubscribe> List-Archive: <http://lists.freebsd.org/pipermail/svn-src-head/> List-Post: <mailto:svn-src-head@freebsd.org> List-Help: <mailto:svn-src-head-request@freebsd.org?subject=help> List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/svn-src-head>, <mailto:svn-src-head-request@freebsd.org?subject=subscribe> X-List-Received-Date: Wed, 01 Feb 2017 03:29:15 -0000 Author: jhibbits Date: Wed Feb 1 03:29:13 2017 New Revision: 313036 URL: https://svnweb.freebsd.org/changeset/base/313036 Log: Add Book-E Enhanced Debug (E.D) profile debug support Freescale added the E.D profile to e500mc and derivative cores. From Freescale's EREF reference manual this is enabled by a bit in HID0 and should otherwise default to traditional debug. However, none of the Freescale cores support that bit, and instead always use E.D. This results in kernel panics using the standard debug on e500mc+ cores. Enhanced debug allows debugging of interrupts, including critical interrupts, as it uses a different save/restore registers (srr*). At this time we don't use this ability, so instead share the core of the debug handler code between both handlers. MFC after: 3 weeks Modified: head/sys/powerpc/booke/booke_machdep.c head/sys/powerpc/booke/trap_subr.S head/sys/powerpc/include/spr.h Modified: head/sys/powerpc/booke/booke_machdep.c ============================================================================== --- head/sys/powerpc/booke/booke_machdep.c Wed Feb 1 02:42:45 2017 (r313035) +++ head/sys/powerpc/booke/booke_machdep.c Wed Feb 1 03:29:13 2017 (r313036) @@ -187,6 +187,7 @@ extern void *int_watchdog; extern void *int_data_tlb_error; extern void *int_inst_tlb_error; extern void *int_debug; +extern void *int_debug_ed; extern void *int_vec; extern void *int_vecast; #ifdef HWPMC_HOOKS @@ -242,6 +243,7 @@ ivor_setup(void) case FSL_E500mc: case FSL_E5500: SET_TRAP(SPR_IVOR7, int_fpu); + SET_TRAP(SPR_IVOR15, int_debug_ed); break; case FSL_E500v1: case FSL_E500v2: Modified: head/sys/powerpc/booke/trap_subr.S ============================================================================== --- head/sys/powerpc/booke/trap_subr.S Wed Feb 1 02:42:45 2017 (r313035) +++ head/sys/powerpc/booke/trap_subr.S Wed Feb 1 03:29:13 2017 (r313036) @@ -794,6 +794,22 @@ interrupt_vector_top: INTERRUPT(int_debug) STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_CSRR0, SPR_CSRR1) FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG) + bl int_debug_int + FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1) + rfci + +INTERRUPT(int_debug_ed) + STANDARD_CRIT_PROLOG(SPR_SPRG2, PC_BOOKE_CRITSAVE, SPR_DSRR0, SPR_DSRR1) + FRAME_SETUP(SPR_SPRG2, PC_BOOKE_CRITSAVE, EXC_DEBUG) + bl int_debug_int + FRAME_LEAVE(SPR_DSRR0, SPR_DSRR1) + rfdi + /* .long 0x4c00004e */ + +/* Internal helper for debug interrupt handling. */ +/* Common code between e500v1/v2 and e500mc-based cores. */ +int_debug_int: + mflr %r14 GET_CPUINFO(%r3) lwz %r3, (PC_BOOKE_CRITSAVE+CPUSAVE_SRR0)(%r3) bl 0f @@ -819,7 +835,8 @@ INTERRUPT(int_debug) mtspr SPR_SRR0, %r3 lwz %r4, (PC_BOOKE_CRITSAVE+CPUSAVE_SRR1+8)(%r4); mtspr SPR_SRR1, %r4 - b 9f + mtlr %r14 + blr 1: addi %r3, %r1, 8 bl CNAME(trap) @@ -828,10 +845,6 @@ INTERRUPT(int_debug) * We actually need to return to the process with an rfi. */ b trapexit -9: - FRAME_LEAVE(SPR_CSRR0, SPR_CSRR1) - rfci - /***************************************************************************** * Common trap code Modified: head/sys/powerpc/include/spr.h ============================================================================== --- head/sys/powerpc/include/spr.h Wed Feb 1 02:42:45 2017 (r313035) +++ head/sys/powerpc/include/spr.h Wed Feb 1 03:29:13 2017 (r313036) @@ -671,6 +671,8 @@ #define SPR_CSRR1 0x03b /* ..8 59 Critical SRR1 */ #define SPR_MCSRR0 0x23a /* ..8 570 Machine check SRR0 */ #define SPR_MCSRR1 0x23b /* ..8 571 Machine check SRR1 */ +#define SPR_DSRR0 0x23e /* ..8 574 Debug SRR0<E.ED> */ +#define SPR_DSRR1 0x23f /* ..8 575 Debug SRR1<E.ED> */ #define SPR_MMUCR 0x3b2 /* 4.. MMU Control Register */ #define MMUCR_SWOA (0x80000000 >> 7)
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20170201123838.X1974>