From owner-svn-src-head@FreeBSD.ORG Wed Nov 6 03:45:18 2013 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) (using TLSv1 with cipher ADH-AES256-SHA (256/256 bits)) (No client certificate requested) by hub.freebsd.org (Postfix) with ESMTP id DCCA7B8F; Wed, 6 Nov 2013 03:45:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 852082CFA; Wed, 6 Nov 2013 03:45:17 +0000 (UTC) Received: from c122-106-156-23.carlnfd1.nsw.optusnet.com.au (c122-106-156-23.carlnfd1.nsw.optusnet.com.au [122.106.156.23]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 91B39D43509; Wed, 6 Nov 2013 14:44:58 +1100 (EST) Date: Wed, 6 Nov 2013 14:44:55 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: John-Mark Gurney Subject: Re: svn commit: r256132 - head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs In-Reply-To: <20131105195519.GX73243@funkthat.com> Message-ID: <20131106135214.I3083@besplex.bde.org> References: <201310080138.r981cPcT048197@svn.freebsd.org> <5278C496.4000002@FreeBSD.org> <20131105195519.GX73243@funkthat.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=bpB1Wiqi c=1 sm=1 tr=0 a=ebeQFi2P/qHVC0Yw9JDJ4g==:117 a=PO7r1zJSAAAA:8 a=5YL3BuiIPZQA:10 a=kj9zAlcOel0A:10 a=JzwRw_2MAAAA:8 a=c4KWZsxl0uEA:10 a=6I5d2MoRAAAA:8 a=c7uO3wGGmkYMksNKIE8A:9 a=CjuIK1q_8ugA:10 Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Xin LI , Andriy Gapon X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 06 Nov 2013 03:45:18 -0000 On Tue, 5 Nov 2013, John-Mark Gurney wrote: > Andriy Gapon wrote this message on Tue, Nov 05, 2013 at 12:12 +0200: >> on 08/10/2013 04:38 Xin LI said the following: >>> Author: delphij >>> Date: Tue Oct 8 01:38:24 2013 >>> New Revision: 256132 >>> URL: http://svnweb.freebsd.org/changeset/base/256132 >>> >>> Log: >>> Improve lzjb decompress performance by reorganizing the code >>> to tighten the copy loop. >>> >>> Submitted by: Denis Ahrens >>> MFC after: 2 weeks >>> Approved by: re (gjb) >>> >>> Modified: >>> head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c >>> >>> Modified: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c >>> ============================================================================== >>> --- head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c Mon Oct 7 22:30:03 2013 (r256131) >>> +++ head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lzjb.c Tue Oct 8 01:38:24 2013 (r256132) >>> @@ -117,7 +117,9 @@ lzjb_decompress(void *s_start, void *d_s >>> src += 2; >>> if ((cpy = dst - offset) < (uchar_t *)d_start) >>> return (-1); >>> - while (--mlen >= 0 && dst < d_end) >>> + if (mlen > (d_end - dst)) >>> + mlen = d_end - dst; >>> + while (--mlen >= 0) >>> *dst++ = *cpy++; >>> } else { >>> *dst++ = *src++; >>> >> >> Intuitively it might seem that this change is indeed an improvement. >> But given how "not dumb" (not sure if that always amounts to smart) the modern >> compilers are, has anyone actually measured if this change indeed improves the >> performance? >> >> Judging from the conversations on the ZFS mailing lists this change event hurt >> performance in some environments. >> Looks like the upstream is not going to take this change. >> >> So does it make any sense for us to make the code more obscure and different >> from upstream? Unless performance benefits could indeed be demonstrated. > > The old code compiles to: > 62c13: 49 f7 de neg %r14 > 62c16: 44 8d 58 ff lea -0x1(%rax),%r11d > 62c1a: 4c 89 d3 mov %r10,%rbx > 62c1d: 0f 1f 00 nopl (%rax) > 62c20: 42 8a 14 33 mov (%rbx,%r14,1),%dl > 62c24: 88 13 mov %dl,(%rbx) > 62c26: 48 ff c3 inc %rbx > 62c29: ff c8 dec %eax > 62c2b: 85 c0 test %eax,%eax > 62c2d: 7f f1 jg 62c20 > > The load/store line is addresses 62c20-62c29... So it looks like clang > is already doing the optimization that was added... It is unlikely to make a difference, but this is fairly bad source and object code: - byte at a time accesses. The loop just copies data. Even calling memcpy might be faster. Compilers used to inline memcpy too much, but they now know that they don't understand memory, so the call would probably be external, though it should be inline iff the size of the copy is small - the source code asks for, and gets a slow signed comparison. The compiler is apparently not able to optimized, and the extra instruction at 62c2b is needed. - rewrites to change the loop termination condition are often to change it from (--v >= 0) to (--v == 0). This isn't done here, so there is still the extra instruction at 62c2b. I don't like signed variables, this optimization often occurse automatically because size_t is used for byte counts and it is unsigned. mlen is apparently signed and not size_t, else the optimization would occur automatically (if you write ">= 0", this makes no sense for unsigned variables, and the compiler will translate it to "== 0". - oops, the extra instruction at 62c2b is not needed. The compiler could have used "subl $1,%eax" instead of "decl %eax" to set the sign flags. A minimal bytewise copying loop on 64-bit x86 looks like: 1: movb (%rsi,rcx,1),%al # from top down movb %al,(%rdi,rcx,1) decq %rcx jne 1b # end with last byte not copied ... # copy last byte now, or modfify end condition 4 instructions instead of 6. But this probably makes little difference on OOE x86. This can use subl too, to allow using jg so as not to need an extra step to copy the last byte. Using decl is a good optimization for space on 32-bit values, but IIRC decq is not a single byte instruction so using it is not such a good optimization for space and using subl loses less space. This already copies from the top down so as to allow an end condition of nearly %rcx == 0. Perhaps the pointers and counters can be adjusted a bit more to get to exactly %rcx == 0, but this probably has more setup overheads. Getting the end condition to be exactly '--v == 0' tends to take more C code too. Compilers know about this problem, and sometimes generate an extra step outside of the loop even if you structure the loop unnaturally to try to avoid this. The loop could also be unrolled a bit. ISTR noticing clang doing automatic unrolling. Perhaps this is the main thing affected by manual optimizations. They might be harmful because the confuse the compiler into not doing automatic unrolling, or harmful because they help the compiler do automatic unrolling when it shouldn't, or helpful... In my copy benchmarks, I didn't find reducing the loop overhead by adjusting only the count register to be very useful, since loop overhead is easy to hide in the src/dst access overhead. When there are cache misses for src/dst, loop overhead is in the noise, so all of this is only interesting for the L1-cached case. It usually takes only minor (bit often some) unrolling to amortize the loop overhead in the L1-cached case. Bruce