Date: Mon, 13 Feb 2017 01:27:16 -0800 From: Mark Millard <markmi@dsl-only.net> To: Alexandre Martins <alexandre.martins@stormshield.eu> Cc: freebsd-arm <freebsd-arm@freebsd.org>, Ian Lepore <ian@freebsd.org> Subject: Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail] Message-ID: <8E5F8A15-2F79-4015-B93B-975D27308782@dsl-only.net> In-Reply-To: <7424243.zp5tqGREgJ@pc-alex> References: <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> <7424243.zp5tqGREgJ@pc-alex>
next in thread | previous in thread | raw e-mail | index | archive | help
On 2017-Feb-13, at 12:28 AM, Alexandre Martins <alexandre.martins at = stormshield.eu> wrote: > Le vendredi 10 f=C3=A9vrier 2017, 18:11:23 Mark Millard a =C3=A9crit : >> On 2017-Feb-10, at 7:51 AM, Alexandre Martins <alexandre.martins at=20= > stormshield.eu> wrote: >>> Hi all >>>=20 >>> I see in the kernel code of bcopy/memmove an optimization if the = copied >>> region don't overlap: >>>=20 >>> = https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=3Dannotate= #l >>> 403 >>>=20 >>> I'm a newbie in ARM assembly, so, sorry if i'm wrong, but >>> - R0 is the dst (1st parameter) >>> - R1 is the src (2nd parameter) >>>=20 >>> So : >>> subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >>> subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >>> cmp r3, r2 /* if (r3 < len) we have an overlap */ >>> bcc PIC_SYM(_C_LABEL(memcpy), PLT) >>>=20 >>> Seems to be inverted. Should it be that ? : >>> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >>> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >>> cmp r3, r2 /* if (r3 < len) we have an overlap */ >>> bcs PIC_SYM(_C_LABEL(memcpy), PLT) >>>=20 >>>=20 >>> Best regards >>=20 >> I did not expect something so central that has been >> around so long to look wrong but. . . >>=20 >> Going through the supporting details of the original >> code based on my looking around here is what I found: >>=20 >> #include <string.h> >> void *memmove(void *dest, const void *src, size_t n); >>=20 >> So I'd expect (c'ish notation): >> r0=3D=3Ddest >> r1=3D=3Dsrc >> r2=3D=3Dn >>=20 >> Then for (the comments vs. code is being challenged): >> (The comments do seem to give the intent correctly.) >>=20 >> cmp r0, r1 >> RETeq /* Bail now if src/dst are the same */ >> subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >> subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >> cmp r3, r2 /* if (r3 < len) we have an overlap */ >> bcc PIC_SYM(_C_LABEL(memcpy), PLT) >> . . . >>=20 >> cmp r0,r1 should result in condition code (c'ish notation): >>=20 >> eq=3D(r0=3D=3Dr1) >> cc=3D(r0<r1) >> cs=3D(r0>=3Dr1) >>=20 >> (Only the r0 position has no immediate-value alternative.) >>=20 >>=20 >> subcc r3,r0,r1 is: if (cc) r3=3Dr0-r1 // no condition code change >> In other words: if (dst<src) r3=3Ddst-src >>=20 >> So it does not match the test listed in the comment as >> far as I can see. And in (mathematical) integer arithmetic >> the r3 result would be negative for dst-src. For >> non-negative arithmetic (natural or whole): undefined. >>=20 >>=20 >> subcs r3,r1,r0 is: if (cs) r3=3Dr1-r0 // no condition code change >> In other words: if (dst>=3Dsrc) r3=3Dsrc-dst >>=20 >> So it does not match the test listed in the comment as >> far as I can see. And in (mathematical) integer arithmetic >> the r3 result would be nonpositive for src-dst. But the >> earlier RETeq prevents the zero case, so: negative. For >> non-negative arithmetic (natural or whole): undefined. >>=20 >>=20 >> If it was only a normal mathemetical context r3=3D-abs(dst-src) >> would be a summary of the two sub instruction sequence as it >> is from what I can tell. >>=20 >> For the purpose the summary should be: r3=3Dabs(dst-src), given >> dst!=3Dsrc . There is no need to wonder outside normal >> mathematical non-negative arithmetic in the process either. >>=20 >> Your code change would have the right summary and use only >> normal mathematical rules from what I can tell: >>=20 >> cmp r0, r1 >> RETeq /* Bail now if src/dst are the same */ >> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ >> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ >> cmp r3, r2 /* if (r3 < len) we have an overlap */ >> bcs PIC_SYM(_C_LABEL(memcpy), PLT) >> . . . >>=20 >> subcs r3,r0,r1 is: if (cs) r3=3Dr0-r1 // no condition code change >> In other words: if (dst>=3Dsrc) r3=3Ddst-src. >> Given the prior RETeq, that is effectively: if (dst>src) r3=3Ddst-src. >> And that matches the comments and would produce a positive result >> for r3, matching the normal mathematical result. >>=20 >> subcc r3,r1,r0 is: if (cc) r3=3Dr1-r0 // no condition code change >> In other words: if (dst<src) r3=3Dsrc-dst >> And that matches the comments and would produce a positive result >> for r3, matching the normal mathematical result. >>=20 >> Overall summary of the two updated sub?? instructions: >> r3=3Dabs(dst-src), given dst!=3Dsrc. >>=20 >> And that would make for an appropriate comparison to n (i.e., to r2). >>=20 >> It appears to have been as it is now since -r143175 when memmove was >> added to sys/arm/arm/support.S (2005-Apr-12). >>=20 >>=20 >> =3D=3D=3D >> Mark Millard >> markmi at dsl-only.net >=20 >=20 >=20 > Thank you for this deep anaysis ! >=20 > I also made some benchmark. It seems that the "Xscale" version of=20 > memcpy/memmove is slower that the standard "ARM" on my platform = (armada388). >=20 > I do the change by undefine the _ARM_ARCH_5E. > #define _ARM_ARCH_5E =3D> Xscale version > #undef _ARM_ARCH_5E =3D> "ARM" version >=20 > There is my results: >=20 > Block size: 2048 > memcpy (Kernel ARM) : 1028.7 MB/s > memmove (Kernel ARM) : 616.5 MB/s > memcpy (Kernel xscale) : 920.1 MB/s > memmove (Kernel xscale) : 618.8 MB/s >=20 > Block size: 128 > memcpy (Kernel ARM) : 1018.5 MB/s > memmove (Kernel ARM) : 668.4 MB/s > memcpy (Kernel xscale) : 825.9 MB/s > memmove (Kernel xscale) : 668.6 MB/s >=20 > Block size: 64 > memcpy (Kernel ARM) : 892.9 MB/s > memmove (Kernel ARM) : 667.2 MB/s > memcpy (Kernel xscale) : 721.2 MB/s > memmove (Kernel xscale) : 668.2 MB/s >=20 > Block size: 32 > memcpy (Kernel ARM) : 620.6 MB/s > memmove (Kernel ARM) : 634.6 MB/s > memcpy (Kernel xscale) : 504.9 MB/s > memmove (Kernel xscale) : 634.5 MB/s >=20 > Block size; 16 > memcpy (Kernel ARM) : 471.8 MB/s > memmove (Kernel ARM) : 464.5 MB/s > memcpy (Kernel xscale) : 254.5 MB/s > memmove (Kernel xscale) : 464.7 MB/s >=20 > As you can see, more the size of the memcpy is small, more the = standard (ARM)=20 > version of memcpy is faster. >=20 > In addition, the libc version suffer the same problem, but is 15% more = efficiant. >=20 > What can I do to help you on this point ? >=20 > Best regards > --=20 > Alexandre Martins > STORMSHIELD I recommend submitting your original discovery to bugzilla: https://bugs.freebsd.org/bugzilla/enter_bug.cgi I can submit the original find if you are not going to. (I would likely not submit a benchmark report since I have not done such benchmarking or analysis producing expected-performance estimates.) I'm not a FreeBSD committer, nor an arm expert. I'm not working on such issues as such. I just decided to look-up/study enough material to confirm or deny what you had written, in part because the notation looked like it could be easy to get the condition code suffixes and argument order mismatched for sub??. (An alternate fix is to reverse the operands in the two sub?? commands but leave the cc and cs as they are --but then the comments would also need to be updated track.) I had added Ian L. as a CC: because he is knowledgable, active, and a committer. If I messed up in my studying the material he would likely catch my mistakes. If you (and my confirmation) are right then he likely could fix the code in svn as well. As the decision about when to call the code that can deal with overlapping memory regions is wrong, the code that should only be used for non-overlaping regions likely would handle some overlapping regions and so would operate incorrectly in at least some cases. In other words, I think the bug is worse than just an example of being sub-optimal: the code is wrong from what I can tell. (I've no clue if the code is ever put to use for any bad cases.) It is true that for the non-overlapping cases that are sent to the more general, slower overlap-handling-capable code the results would likely be worse than for the code designed for non-overlapping-only. As your your benchmarks: I'm not sure if you benchmarked the original code vs. your corrected code. The corrected code would be the most intersting (presuming you [and my confirmation] are correct). If you submit to bugzilla, I'd suggest any benchmark reports be submitted separately from the original issue with the sub?? instructions. The 32-bit arm's that I have access to are both Cortex-A7 based: a BPI-M3 and an RPI2, so armv7. I've not tried such benchmarking on them. And I'm not likely to, at least not any time soon. =3D=3D=3D Mark Millard markmi at dsl-only.net
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?8E5F8A15-2F79-4015-B93B-975D27308782>