Date: Mon, 13 Feb 2017 09:28:31 +0100 From: Alexandre Martins <alexandre.martins@stormshield.eu> To: Mark Millard <markmi@dsl-only.net> Cc: freebsd-arm <freebsd-arm@freebsd.org>, Ian Lepore <ian@freebsd.org> Subject: Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail] Message-ID: <7424243.zp5tqGREgJ@pc-alex> In-Reply-To: <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> References: <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net>
index | next in thread | previous in thread | raw e-mail
[-- Attachment #1 --] Le vendredi 10 fvrier 2017, 18:11:23 Mark Millard a crit : > On 2017-Feb-10, at 7:51 AM, Alexandre Martins <alexandre.martins at stormshield.eu> wrote: > > Hi all > > > > I see in the kernel code of bcopy/memmove an optimization if the copied > > region don't overlap: > > > > https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=annotate#l > > 403 > > > > I'm a newbie in ARM assembly, so, sorry if i'm wrong, but > > - R0 is the dst (1st parameter) > > - R1 is the src (2nd parameter) > > > > So : > > subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ > > subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > > cmp r3, r2 /* if (r3 < len) we have an overlap */ > > bcc PIC_SYM(_C_LABEL(memcpy), PLT) > > > > Seems to be inverted. Should it be that ? : > > subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */ > > subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > > cmp r3, r2 /* if (r3 < len) we have an overlap */ > > bcs PIC_SYM(_C_LABEL(memcpy), PLT) > > > > > > Best regards > > I did not expect something so central that has been > around so long to look wrong but. . . > > Going through the supporting details of the original > code based on my looking around here is what I found: > > #include <string.h> > void *memmove(void *dest, const void *src, size_t n); > > So I'd expect (c'ish notation): > r0==dest > r1==src > r2==n > > Then for (the comments vs. code is being challenged): > (The comments do seem to give the intent correctly.) > > cmp r0, r1 > RETeq /* Bail now if src/dst are the same */ > subcc r3, r0, r1 /* if (dst > src) r3 = dst - src */ > subcs r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcc PIC_SYM(_C_LABEL(memcpy), PLT) > . . . > > cmp r0,r1 should result in condition code (c'ish notation): > > eq=(r0==r1) > cc=(r0<r1) > cs=(r0>=r1) > > (Only the r0 position has no immediate-value alternative.) > > > subcc r3,r0,r1 is: if (cc) r3=r0-r1 // no condition code change > In other words: if (dst<src) r3=dst-src > > So it does not match the test listed in the comment as > far as I can see. And in (mathematical) integer arithmetic > the r3 result would be negative for dst-src. For > non-negative arithmetic (natural or whole): undefined. > > > subcs r3,r1,r0 is: if (cs) r3=r1-r0 // no condition code change > In other words: if (dst>=src) r3=src-dst > > So it does not match the test listed in the comment as > far as I can see. And in (mathematical) integer arithmetic > the r3 result would be nonpositive for src-dst. But the > earlier RETeq prevents the zero case, so: negative. For > non-negative arithmetic (natural or whole): undefined. > > > If it was only a normal mathemetical context r3=-abs(dst-src) > would be a summary of the two sub instruction sequence as it > is from what I can tell. > > For the purpose the summary should be: r3=abs(dst-src), given > dst!=src . There is no need to wonder outside normal > mathematical non-negative arithmetic in the process either. > > Your code change would have the right summary and use only > normal mathematical rules from what I can tell: > > cmp r0, r1 > RETeq /* Bail now if src/dst are the same */ > subcs r3, r0, r1 /* if (dst > src) r3 = dst - src */ > subcc r3, r1, r0 /* if (src > dsr) r3 = src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcs PIC_SYM(_C_LABEL(memcpy), PLT) > . . . > > subcs r3,r0,r1 is: if (cs) r3=r0-r1 // no condition code change > In other words: if (dst>=src) r3=dst-src. > Given the prior RETeq, that is effectively: if (dst>src) r3=dst-src. > And that matches the comments and would produce a positive result > for r3, matching the normal mathematical result. > > subcc r3,r1,r0 is: if (cc) r3=r1-r0 // no condition code change > In other words: if (dst<src) r3=src-dst > And that matches the comments and would produce a positive result > for r3, matching the normal mathematical result. > > Overall summary of the two updated sub?? instructions: > r3=abs(dst-src), given dst!=src. > > And that would make for an appropriate comparison to n (i.e., to r2). > > It appears to have been as it is now since -r143175 when memmove was > added to sys/arm/arm/support.S (2005-Apr-12). > > > === > Mark Millard > markmi at dsl-only.net Thank you for this deep anaysis ! I also made some benchmark. It seems that the "Xscale" version of memcpy/memmove is slower that the standard "ARM" on my platform (armada388). I do the change by undefine the _ARM_ARCH_5E. #define _ARM_ARCH_5E => Xscale version #undef _ARM_ARCH_5E => "ARM" version There is my results: Block size: 2048 memcpy (Kernel ARM) : 1028.7 MB/s memmove (Kernel ARM) : 616.5 MB/s memcpy (Kernel xscale) : 920.1 MB/s memmove (Kernel xscale) : 618.8 MB/s Block size: 128 memcpy (Kernel ARM) : 1018.5 MB/s memmove (Kernel ARM) : 668.4 MB/s memcpy (Kernel xscale) : 825.9 MB/s memmove (Kernel xscale) : 668.6 MB/s Block size: 64 memcpy (Kernel ARM) : 892.9 MB/s memmove (Kernel ARM) : 667.2 MB/s memcpy (Kernel xscale) : 721.2 MB/s memmove (Kernel xscale) : 668.2 MB/s Block size: 32 memcpy (Kernel ARM) : 620.6 MB/s memmove (Kernel ARM) : 634.6 MB/s memcpy (Kernel xscale) : 504.9 MB/s memmove (Kernel xscale) : 634.5 MB/s Block size; 16 memcpy (Kernel ARM) : 471.8 MB/s memmove (Kernel ARM) : 464.5 MB/s memcpy (Kernel xscale) : 254.5 MB/s memmove (Kernel xscale) : 464.7 MB/s As you can see, more the size of the memcpy is small, more the standard (ARM) version of memcpy is faster. In addition, the libc version suffer the same problem, but is 15% more efficiant. What can I do to help you on this point ? Best regards -- Alexandre Martins STORMSHIELD [-- Attachment #2 --] 0 *H 010 `He 0 *H 00n 0 *H 0H10 UFR10U STORMSHIELD1#0!UStormshield Root Authority0 140904150710Z 240901150710Z0I10 UFR10U STORMSHIELD1$0"UStormshield Users Authority0"0 *H 0 X6[t.D Uge0-2;h@eȻClΫpB#M,FY=.{ya{2πߢ 7 <d~O;ޅԋ&C 8ُ6@CXX>|abqenmI, O&'۰@%OhW& {52D%_8#f]G0ct y\ v0t0Uml||uu4[ׁ0U#0BgDaP0U00U0 `HB0 *H N9-؞>m-K!M-7zD6IZʾr>q?u ?xr6_'NջZ7]V\5Y&))m@~^Qdp/H3ͦ`o⬋U,z,0Bdp!C2K8.r>0-1!C%3U2ϠMg~3͟wͲAZ&vF/BA$_ڀV! s7)=-nP>qH~g/Rs,PAJVm#QXJo */V|sĿj~y߽j 3]v3ƌqe`Y|HB3lC!^J2$4A`Z+( 7b{e˄wʈ<L=6[3`>IQFaiܝ ?:_yԊ /`0W0? (˻0 *H 0I10 UFR10U STORMSHIELD1$0"UStormshield Users Authority0 160901151108Z 170901151108Z0p10 UFR10U STORMSHIELD10UAlexandre MARTINS1/0- *H alexandre.martins@stormshield.eu0"0 *H 0 ~ {x~;#3BgXC[rSVuv#>aL0w}""v`B)Db8qHkH\d qB6rd|:%Ze[wi3)$! hXε<صVO}# EjJegk0%L퉬Q'b}3"*(_T- w?gR H,\aWO4Ǘ Y3TR(Wn7Sv1n8xIan 00UD0IM_W]A$v#<0U#0ml||uu4[ׁ0 U0 0U0 `HB0U%0++0JUC0A0?=;9https://pki.netasq.com/auth/certificaterevocationlist.crl0U 00U 0+U$0" alexandre.martins@stormshield.eu0 *H 5hhN ̛U92+Ejr@|f{`J2X!ȳ.ڤ|^*$"vlHtSeeAܟ-bC$*)ЖWJL >5@|N%·%{i-4akG앏]%.ݤp]7B/ *1WE@zZoq@fk0NpHObD$9jXḓSk~LKX100R0I10 UFR10U STORMSHIELD1$0"UStormshield Users Authority (˻0 `He 0 *H 1 *H 0 *H 1 170213082831Z0( *H 100 `He0 *H 0/ *H 1" Ofk\gYZw/SH8DL0 *H І=kHt9 {Ѩa\9Ë|wXaWN<drҡT°>שrMd J:UTݱC9Cy[|fܬV;GOH"dŖ+qPȓq߭iֲ.(l&ŏUv"tѾX0#bZ3CM_xc$S~B}!3!3sss-9yKUHdukhelp
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7424243.zp5tqGREgJ>
