From owner-freebsd-arm@freebsd.org Mon Feb 13 09:27:20 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CE5CECDD88A for ; Mon, 13 Feb 2017 09:27:20 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: from asp.reflexion.net (outbound-mail-210-73.reflexion.net [208.70.210.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8F60C17A8 for ; Mon, 13 Feb 2017 09:27:19 +0000 (UTC) (envelope-from markmi@dsl-only.net) Received: (qmail 20911 invoked from network); 13 Feb 2017 09:27:18 -0000 Received: from unknown (HELO mail-cs-02.app.dca.reflexion.local) (10.81.19.2) by 0 (rfx-qmail) with SMTP; 13 Feb 2017 09:27:18 -0000 Received: by mail-cs-02.app.dca.reflexion.local (Reflexion email security v8.30.0) with SMTP; Mon, 13 Feb 2017 04:27:18 -0500 (EST) Received: (qmail 22101 invoked from network); 13 Feb 2017 09:27:18 -0000 Received: from unknown (HELO iron2.pdx.net) (69.64.224.71) by 0 (rfx-qmail) with (AES256-SHA encrypted) SMTP; 13 Feb 2017 09:27:18 -0000 Received: from [192.168.1.111] (c-67-170-167-181.hsd1.or.comcast.net [67.170.167.181]) by iron2.pdx.net (Postfix) with ESMTPSA id 65E98EC7B39; Mon, 13 Feb 2017 01:27:17 -0800 (PST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 10.2 \(3259\)) Subject: Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail] From: Mark Millard In-Reply-To: <7424243.zp5tqGREgJ@pc-alex> Date: Mon, 13 Feb 2017 01:27:16 -0800 Cc: freebsd-arm , Ian Lepore Content-Transfer-Encoding: quoted-printable Message-Id: <8E5F8A15-2F79-4015-B93B-975D27308782@dsl-only.net> References: <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> <7424243.zp5tqGREgJ@pc-alex> To: Alexandre Martins X-Mailer: Apple Mail (2.3259) X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Feb 2017 09:27:20 -0000 On 2017-Feb-13, at 12:28 AM, Alexandre Martins wrote: > Le vendredi 10 f=C3=A9vrier 2017, 18:11:23 Mark Millard a =C3=A9crit : >> On 2017-Feb-10, at 7:51 AM, Alexandre Martins stormshield.eu> wrote: >>> Hi all >>>=20 >>> I see in the kernel code of bcopy/memmove an optimization if the = copied >>> region don't overlap: >>>=20 >>> = https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=3Dannotate= #l >>> 403 >>>=20 >>> I'm a newbie in ARM assembly, so, sorry if i'm wrong, but >>> - R0 is the dst (1st parameter) >>> - R1 is the src (2nd parameter) >>>=20 >>> So : >>> subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >>> subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >>> cmp r3, r2 /* if (r3 < len) we have an overlap */ >>> bcc PIC_SYM(_C_LABEL(memcpy), PLT) >>>=20 >>> Seems to be inverted. Should it be that ? : >>> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >>> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >>> cmp r3, r2 /* if (r3 < len) we have an overlap */ >>> bcs PIC_SYM(_C_LABEL(memcpy), PLT) >>>=20 >>>=20 >>> Best regards >>=20 >> I did not expect something so central that has been >> around so long to look wrong but. . . >>=20 >> Going through the supporting details of the original >> code based on my looking around here is what I found: >>=20 >> #include >> void *memmove(void *dest, const void *src, size_t n); >>=20 >> So I'd expect (c'ish notation): >> r0=3D=3Ddest >> r1=3D=3Dsrc >> r2=3D=3Dn >>=20 >> Then for (the comments vs. code is being challenged): >> (The comments do seem to give the intent correctly.) >>=20 >> cmp r0, r1 >> RETeq /* Bail now if src/dst are the same */ >> subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >> subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >> cmp r3, r2 /* if (r3 < len) we have an overlap */ >> bcc PIC_SYM(_C_LABEL(memcpy), PLT) >> . . . >>=20 >> cmp r0,r1 should result in condition code (c'ish notation): >>=20 >> eq=3D(r0=3D=3Dr1) >> cc=3D(r0> cs=3D(r0>=3Dr1) >>=20 >> (Only the r0 position has no immediate-value alternative.) >>=20 >>=20 >> subcc r3,r0,r1 is: if (cc) r3=3Dr0-r1 // no condition code change >> In other words: if (dst>=20 >> So it does not match the test listed in the comment as >> far as I can see. And in (mathematical) integer arithmetic >> the r3 result would be negative for dst-src. For >> non-negative arithmetic (natural or whole): undefined. >>=20 >>=20 >> subcs r3,r1,r0 is: if (cs) r3=3Dr1-r0 // no condition code change >> In other words: if (dst>=3Dsrc) r3=3Dsrc-dst >>=20 >> So it does not match the test listed in the comment as >> far as I can see. And in (mathematical) integer arithmetic >> the r3 result would be nonpositive for src-dst. But the >> earlier RETeq prevents the zero case, so: negative. For >> non-negative arithmetic (natural or whole): undefined. >>=20 >>=20 >> If it was only a normal mathemetical context r3=3D-abs(dst-src) >> would be a summary of the two sub instruction sequence as it >> is from what I can tell. >>=20 >> For the purpose the summary should be: r3=3Dabs(dst-src), given >> dst!=3Dsrc . There is no need to wonder outside normal >> mathematical non-negative arithmetic in the process either. >>=20 >> Your code change would have the right summary and use only >> normal mathematical rules from what I can tell: >>=20 >> cmp r0, r1 >> RETeq /* Bail now if src/dst are the same */ >> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >> cmp r3, r2 /* if (r3 < len) we have an overlap */ >> bcs PIC_SYM(_C_LABEL(memcpy), PLT) >> . . . >>=20 >> subcs r3,r0,r1 is: if (cs) r3=3Dr0-r1 // no condition code change >> In other words: if (dst>=3Dsrc) r3=3Ddst-src. >> Given the prior RETeq, that is effectively: if (dst>src) r3=3Ddst-src. >> And that matches the comments and would produce a positive result >> for r3, matching the normal mathematical result. >>=20 >> subcc r3,r1,r0 is: if (cc) r3=3Dr1-r0 // no condition code change >> In other words: if (dst> And that matches the comments and would produce a positive result >> for r3, matching the normal mathematical result. >>=20 >> Overall summary of the two updated sub?? instructions: >> r3=3Dabs(dst-src), given dst!=3Dsrc. >>=20 >> And that would make for an appropriate comparison to n (i.e., to r2). >>=20 >> It appears to have been as it is now since -r143175 when memmove was >> added to sys/arm/arm/support.S (2005-Apr-12). >>=20 >>=20 >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net >=20 >=20 >=20 > Thank you for this deep anaysis ! >=20 > I also made some benchmark. It seems that the "Xscale" version of=20 > memcpy/memmove is slower that the standard "ARM" on my platform = (armada388). >=20 > I do the change by undefine the _ARM_ARCH_5E. > #define _ARM_ARCH_5E =3D> Xscale version > #undef _ARM_ARCH_5E =3D> "ARM" version >=20 > There is my results: >=20 > Block size: 2048 > memcpy (Kernel ARM) : 1028.7 MB/s > memmove (Kernel ARM) : 616.5 MB/s > memcpy (Kernel xscale) : 920.1 MB/s > memmove (Kernel xscale) : 618.8 MB/s >=20 > Block size: 128 > memcpy (Kernel ARM) : 1018.5 MB/s > memmove (Kernel ARM) : 668.4 MB/s > memcpy (Kernel xscale) : 825.9 MB/s > memmove (Kernel xscale) : 668.6 MB/s >=20 > Block size: 64 > memcpy (Kernel ARM) : 892.9 MB/s > memmove (Kernel ARM) : 667.2 MB/s > memcpy (Kernel xscale) : 721.2 MB/s > memmove (Kernel xscale) : 668.2 MB/s >=20 > Block size: 32 > memcpy (Kernel ARM) : 620.6 MB/s > memmove (Kernel ARM) : 634.6 MB/s > memcpy (Kernel xscale) : 504.9 MB/s > memmove (Kernel xscale) : 634.5 MB/s >=20 > Block size; 16 > memcpy (Kernel ARM) : 471.8 MB/s > memmove (Kernel ARM) : 464.5 MB/s > memcpy (Kernel xscale) : 254.5 MB/s > memmove (Kernel xscale) : 464.7 MB/s >=20 > As you can see, more the size of the memcpy is small, more the = standard (ARM)=20 > version of memcpy is faster. >=20 > In addition, the libc version suffer the same problem, but is 15% more = efficiant. >=20 > What can I do to help you on this point ? >=20 > Best regards > --=20 > Alexandre Martins > STORMSHIELD I recommend submitting your original discovery to bugzilla: https://bugs.freebsd.org/bugzilla/enter_bug.cgi I can submit the original find if you are not going to. (I would likely not submit a benchmark report since I have not done such benchmarking or analysis producing expected-performance estimates.) I'm not a FreeBSD committer, nor an arm expert. I'm not working on such issues as such. I just decided to look-up/study enough material to confirm or deny what you had written, in part because the notation looked like it could be easy to get the condition code suffixes and argument order mismatched for sub??. (An alternate fix is to reverse the operands in the two sub?? commands but leave the cc and cs as they are --but then the comments would also need to be updated track.) I had added Ian L. as a CC: because he is knowledgable, active, and a committer. If I messed up in my studying the material he would likely catch my mistakes. If you (and my confirmation) are right then he likely could fix the code in svn as well. As the decision about when to call the code that can deal with overlapping memory regions is wrong, the code that should only be used for non-overlaping regions likely would handle some overlapping regions and so would operate incorrectly in at least some cases. In other words, I think the bug is worse than just an example of being sub-optimal: the code is wrong from what I can tell. (I've no clue if the code is ever put to use for any bad cases.) It is true that for the non-overlapping cases that are sent to the more general, slower overlap-handling-capable code the results would likely be worse than for the code designed for non-overlapping-only. As your your benchmarks: I'm not sure if you benchmarked the original code vs. your corrected code. The corrected code would be the most intersting (presuming you [and my confirmation] are correct). If you submit to bugzilla, I'd suggest any benchmark reports be submitted separately from the original issue with the sub?? instructions. The 32-bit arm's that I have access to are both Cortex-A7 based: a BPI-M3 and an RPI2, so armv7. I've not tried such benchmarking on them. And I'm not likely to, at least not any time soon. =3D=3D=3D Mark Millard markmi at dsl-only.net