Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 13 Feb 2017 09:28:31 +0100
From:      Alexandre Martins <alexandre.martins@stormshield.eu>
To:        Mark Millard <markmi@dsl-only.net>
Cc:        freebsd-arm <freebsd-arm@freebsd.org>, Ian Lepore <ian@freebsd.org>
Subject:   Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail]
Message-ID:  <7424243.zp5tqGREgJ@pc-alex>
In-Reply-To: <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net>
References:  <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net>

next in thread | previous in thread | raw e-mail | index | archive | help
--nextPart2442143.bXjVvZE6mh
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain; charset="iso-8859-1"

Le vendredi 10 f=E9vrier 2017, 18:11:23 Mark Millard a =E9crit :
> On 2017-Feb-10, at 7:51 AM, Alexandre Martins <alexandre.martins at=20=

stormshield.eu> wrote:
> > Hi all
> >=20
> > I see in the kernel code of bcopy/memmove an optimization if the co=
pied
> > region don't overlap:
> >=20
> > https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=3Da=
nnotate#l
> > 403
> >=20
> > I'm a newbie in ARM assembly, so, sorry if i'm wrong, but
> > - R0 is the dst (1st parameter)
> > - R1 is the src (2nd parameter)
> >=20
> > So :
> > subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
> > subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
> > cmp r3, r2 /* if (r3 < len) we have an overlap */
> > bcc PIC_SYM(_C_LABEL(memcpy), PLT)
> >=20
> > Seems to be inverted. Should it be that ? :
> > subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
> > subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
> > cmp r3, r2 /* if (r3 < len) we have an overlap */
> > bcs PIC_SYM(_C_LABEL(memcpy), PLT)
> >=20
> >=20
> > Best regards
>=20
> I did not expect something so central that has been
> around so long to look wrong but. . .
>=20
> Going through the supporting details of the original
> code based on my looking around here is what I found:
>=20
> #include=09<string.h>
> void *memmove(void *dest, const void *src, size_t n);
>=20
> So I'd expect (c'ish notation):
> r0=3D=3Ddest
> r1=3D=3Dsrc
> r2=3D=3Dn
>=20
> Then for (the comments vs. code is being challenged):
> (The comments do seem to give the intent correctly.)
>=20
> cmp     r0, r1
> RETeq           /* Bail now if src/dst are the same */
> subcc   r3, r0, r1      /* if (dst > src) r3 =3D dst - src */
> subcs   r3, r1, r0      /* if (src > dsr) r3 =3D src - dst */
> cmp     r3, r2          /* if (r3 < len) we have an overlap */
> bcc     PIC_SYM(_C_LABEL(memcpy), PLT)
> . . .
>=20
> cmp r0,r1 should result in condition code (c'ish notation):
>=20
> eq=3D(r0=3D=3Dr1)
> cc=3D(r0<r1)
> cs=3D(r0>=3Dr1)
>=20
> (Only the r0 position has no immediate-value alternative.)
>=20
>=20
> subcc r3,r0,r1 is: if (cc) r3=3Dr0-r1 // no condition code change
> In other words: if (dst<src) r3=3Ddst-src
>=20
> So it does not match the test listed in the comment as
> far as I can see. And in (mathematical) integer arithmetic
> the r3 result would be negative for dst-src. For
> non-negative arithmetic (natural or whole): undefined.
>=20
>=20
> subcs r3,r1,r0 is: if (cs) r3=3Dr1-r0 // no condition code change
> In other words: if (dst>=3Dsrc) r3=3Dsrc-dst
>=20
> So it does not match the test listed in the comment as
> far as I can see. And in (mathematical) integer arithmetic
> the r3 result would be nonpositive for src-dst. But the
> earlier RETeq prevents the zero case, so: negative. For
> non-negative arithmetic (natural or whole): undefined.
>=20
>=20
> If it was only a normal mathemetical context r3=3D-abs(dst-src)
> would be a summary of the two sub instruction sequence as it
> is from what I can tell.
>=20
> For the purpose the summary should be: r3=3Dabs(dst-src), given
> dst!=3Dsrc . There is no need to wonder outside normal
> mathematical non-negative arithmetic in the process either.
>=20
> Your code change would have the right summary and use only
> normal mathematical rules from what I can tell:
>=20
> cmp     r0, r1
> RETeq           /* Bail now if src/dst are the same */
> subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */
> subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */
> cmp r3, r2 /* if (r3 < len) we have an overlap */
> bcs PIC_SYM(_C_LABEL(memcpy), PLT)
> . . .
>=20
> subcs r3,r0,r1 is: if (cs) r3=3Dr0-r1 // no condition code change
> In other words: if (dst>=3Dsrc) r3=3Ddst-src.
> Given the prior RETeq, that is effectively: if (dst>src) r3=3Ddst-src=
.
> And that matches the comments and would produce a positive result
> for r3, matching the normal mathematical result.
>=20
> subcc r3,r1,r0 is: if (cc) r3=3Dr1-r0 // no condition code change
> In other words: if (dst<src) r3=3Dsrc-dst
> And that matches the comments and would produce a positive result
> for r3, matching the normal mathematical result.
>=20
> Overall summary of the two updated sub?? instructions:
> r3=3Dabs(dst-src), given dst!=3Dsrc.
>=20
> And that would make for an appropriate comparison to n (i.e., to r2).=

>=20
> It appears to have been as it is now since -r143175 when memmove was
> added to sys/arm/arm/support.S (2005-Apr-12).
>=20
>=20
> =3D=3D=3D
> Mark Millard
> markmi at dsl-only.net



Thank you for this deep anaysis !

I also made some benchmark. It seems that the "Xscale" version of=20
memcpy/memmove is slower that the standard "ARM" on my platform (armada=
388).

I do the change by undefine the _ARM_ARCH_5E.
#define _ARM_ARCH_5E =3D> Xscale version
#undef _ARM_ARCH_5E =3D> "ARM" version

There is my results:

Block size: 2048
memcpy (Kernel ARM)                  :   1028.7 MB/s
memmove (Kernel ARM)               :    616.5 MB/s
memcpy (Kernel xscale)               :    920.1 MB/s
memmove (Kernel xscale)            :    618.8 MB/s

Block size: 128
memcpy (Kernel ARM)                  :   1018.5 MB/s
memmove (Kernel ARM)               :    668.4 MB/s
memcpy (Kernel xscale)               :    825.9 MB/s
memmove (Kernel xscale)            :    668.6 MB/s

Block size: 64
memcpy (Kernel ARM)                  :    892.9 MB/s
memmove (Kernel ARM)               :    667.2 MB/s
memcpy (Kernel xscale)               :    721.2 MB/s
memmove (Kernel xscale)            :    668.2 MB/s

Block size: 32
memcpy (Kernel ARM)                  :    620.6 MB/s
memmove (Kernel ARM)               :    634.6 MB/s
memcpy (Kernel xscale)               :    504.9 MB/s
memmove (Kernel xscale)            :    634.5 MB/s

Block size; 16
memcpy (Kernel ARM)                  :    471.8 MB/s
memmove (Kernel ARM)               :    464.5 MB/s
memcpy (Kernel xscale)               :    254.5 MB/s
memmove (Kernel xscale)            :    464.7 MB/s

As you can see, more the size of the memcpy is small, more the standard=
 (ARM)=20
version of memcpy is faster.

In addition, the libc version suffer the same problem, but is 15% more =
efficiant.

What can I do to help you on this point ?

Best regards
=2D-=20
Alexandre Martins
STORMSHIELD


--nextPart2442143.bXjVvZE6mh
Content-Type: application/pkcs7-signature; name="smime.p7s"
Content-Disposition: attachment; filename="smime.p7s"
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCCCOUw
ggSGMIICbqADAgECAgUA28zw7TANBgkqhkiG9w0BAQsFADBIMQswCQYDVQQGEwJGUjEUMBIGA1UE
CgwLU1RPUk1TSElFTEQxIzAhBgNVBAMMGlN0b3Jtc2hpZWxkIFJvb3QgQXV0aG9yaXR5MB4XDTE0
MDkwNDE1MDcxMFoXDTI0MDkwMTE1MDcxMFowSTELMAkGA1UEBhMCRlIxFDASBgNVBAoMC1NUT1JN
U0hJRUxEMSQwIgYDVQQDDBtTdG9ybXNoaWVsZCBVc2VycyBBdXRob3JpdHkwggEiMA0GCSqGSIb3
DQEBAQUAA4IBDwAwggEKAoIBAQDChwWgC/6VWKL7jgWI3eA2sVvRdOwuHcXsRAAXVWdlMC0ygg7u
45E78GhAnpdl8QbIu7x/Q2zOq6KttspwDEIjkoMLTZngLLlGjYJZPfuSoC6hl9R7vRd5f8Fhu3v0
xQ/7vzKYz4C836IGCrk31gmrPO0H0fxkyxCMfhoTTzue3oXW1IsmQwCrOPOu2Y82QANDhbifWLjI
WJetnj58YRKR82KBs3Flbqxtp0mi9+IswMvCCRSoT+ORB73Cl6URt7Qm7BcD+qnkJ9uwlUC94dIl
T2hX4ybY/w/ssA17Ew418fgyRCWQXzgjZgZ/XUcw2WP9dIggA7Pg+c/xeROJH1zvAgMBAAGjdjB0
MB0GA1UdDgQWBBShbYRsooCFBXx8dXWANMETW5fXgTAfBgNVHSMEGDAWgBS4Qqn6Z0Twf9NhjOyl
x1CutL3sozAPBgNVHRMBAf8EBTADAQH/MA4GA1UdDwEB/wQEAwIBBjARBglghkgBhvhCAQEEBAMC
AQYwDQYJKoZIhvcNAQELBQADggIBAE6C9zkt2J6dPm2KLbzRS6rBIYZNFi0X59g3ekQ2Sc4UWsq+
B3L86j5xnQSRnIM/DKV1+Q2UHbU/qsh4cto2fwTV6V+aJ07Vu/bJE1rAN4AI4V26ytf7VoBcBjVZ
Jq8pHOMp/G2eQH7F1xqzml68DpKku66aUalkcC9IM82m7AW3YAyvDoYEAchv4qyL8qhVLLp6LNru
8ZOhMELhZLWl4ulw/SFDMhcBS6I4wC6icj71MLGSrr61vMktMdwQ+CGFQ5z5JbaxM61VgzKay8+g
lw+xTbpnitrDfhkzHs2fdwOOur3vtNnNsrdBWiYPseJ2k4VGD7ov5kITQZckmZyF/V+Ir//agJQG
VuwhDZCXgXOvrje+FLYp7tQ9pgSvLbluh1A+ywfyHnFI4n6tZy9SD3MIDgWR4KwFLM1Qmt3NQb32
tkq9Vm0jUcQXFfbnWKLA9krw3m8NmCqhL5PzpfOegYOc0QJWfMQamxeWxXMLk6uKisS//+VqfpCa
5Jx53t+9DmoN1+ob4jOprPaX6tfBBr5djah2yzPGjHEB52VgWXxIF9lCM2z7Qw+zFb2PIdNeSjIk
NEFg/1orKAAa5gQXAQynN2J7E+aLf2XLhHcS0v+9yoisPEw9+Tb5F1uQh+gzYD5JUUYcYWncnX8g
P8k6X+F5mQ/81IoNL/IejxJgy/LoMIIEVzCCAz+gAwIBAgIFAIUoy7swDQYJKoZIhvcNAQELBQAw
STELMAkGA1UEBhMCRlIxFDASBgNVBAoMC1NUT1JNU0hJRUxEMSQwIgYDVQQDDBtTdG9ybXNoaWVs
ZCBVc2VycyBBdXRob3JpdHkwHhcNMTYwOTAxMTUxMTA4WhcNMTcwOTAxMTUxMTA4WjBwMQswCQYD
VQQGEwJGUjEUMBIGA1UECgwLU1RPUk1TSElFTEQxGjAYBgNVBAMMEUFsZXhhbmRyZSBNQVJUSU5T
MS8wLQYJKoZIhvcNAQkBFiBhbGV4YW5kcmUubWFydGluc0BzdG9ybXNoaWVsZC5ldTCCASIwDQYJ
KoZIhvcNAQEBBQADggEPADCCAQoCggEBAMN+CnvE13jKEwJ+OyMzwBpC02dY+LpD5luJwnJTVnV2
9aUjEMI+xGFMMHd9kSIVInbk4WDe1ELOKerg0dzgnkRiOHECSGum1UhcZABxQgY2cmSffNQ6JVro
52UaBlt3aTOk3imYJCHUIGgOWMvOtRc8BxyBHdi15FZPj/F9I+AKufRFsBXUakplFIAPEwy3m2eR
a/eCMLqGJUyK7YmsAlEnYn2mA38zIoqtKvL6KPHtrV8fw1SRLQ13+j9nu1LlCaqhmLtILFxhV0/9
uDTvx5cKtZ8Xj1nPM6NUUrso9qlXwm4On6Y34pVTtnYGMQRuljil3Hiz84RJjPDJYRGwbgkCAwEA
AaOCAR0wggEZMB0GA1UdDgQWBBTmRLIwSfhNwbdfV13xt0G0JHYjPDAfBgNVHSMEGDAWgBShbYRs
ooCFBXx8dXWANMETW5fXgTAJBgNVHRMEAjAAMA4GA1UdDwEB/wQEAwID6DARBglghkgBhvhCAQEE
BAMCBLAwHQYDVR0lBBYwFAYIKwYBBQUHAwQGCCsGAQUFBwMCMEoGA1UdHwRDMEEwP6A9oDuGOWh0
dHBzOi8vcGtpLm5ldGFzcS5jb20vYXV0aC9jZXJ0aWZpY2F0ZXJldm9jYXRpb25saXN0LmNybDAR
BgNVHSAECjAIMAYGBFUdIAAwKwYDVR0RBCQwIoEgYWxleGFuZHJlLm1hcnRpbnNAc3Rvcm1zaGll
bGQuZXUwDQYJKoZIhvcNAQELBQADggEBALT9NWiAaE6nDev34vShhsyb9lWBOQfCnAMyKwtFy/cU
uIoHsxyOanIIQHz0ZtB76GCHDo7RStMyp6RYIefIsxABLhSr4hHapJka9g/X/nxexyr0xyT3IpYQ
dmyMSHRT18Z/ZaBlQdyfnS2PYkPHJAHl4iqB4SnQlh3rwFdKTJMgCz413cDxQHytgRPGTiXOhyV7
aS3ANJFha6ZHA8HU9sTslY8ZXSUu94iD3t2kcF3gBb432UKALwryKqnrwzFX68pFpqO5QAjEHaF6
6p1agMb2b3HlQGZrME5wSO6rsZJPYvJEyvrwHxCxjSTkOdPw6GriWGTMrVMU0fVrfptMS1gxggIT
MIICDwIBATBSMEkxCzAJBgNVBAYTAkZSMRQwEgYDVQQKDAtTVE9STVNISUVMRDEkMCIGA1UEAwwb
U3Rvcm1zaGllbGQgVXNlcnMgQXV0aG9yaXR5AgUAhSjLuzANBglghkgBZQMEAgEFAKCBkzAYBgkq
hkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNzAyMTMwODI4MzFaMCgGCSqG
SIb3DQEJDzEbMBkwCwYJYIZIAWUDBAECMAoGCCqGSIb3DQMHMC8GCSqGSIb3DQEJBDEiBCBPtmZr
zlzEZ1n86lp3L69T+qUFzB/lFpNI6DhE+cRMDjANBgkqhkiG9w0BAQEFAASCAQCEf5zQhj1rSPh0
ORcAe9GoYYGyXP0538OLfHfiWGFX2AuUTtka3zxkperwctKhl6pUwrDGPs3XqQWwwYRywk0RZL+s
1haTlNsJSsLLOqZVVKPdsd3jC0Me/As5pfOvQ6TReVvVfGbk3KxWyevWO7ClGEfpt+9PstxI9ggi
iZ1kyB3FlrcrcVCIpY4QGrrIk42S93Hi49+t7QKjaRq+1rKALsQombtsJsWPfxiOAxKLVdQTdiJ0
rfDRvhD87ljsfwLuMCNih1rnvTNDTZwZwu5feGMDJFPGfvTtqAJC6H0hM8ohM/dznnMFc6stlIq6
m8/Oy805sXn9uEtVSGR1+rdrAAAAAAAA

--nextPart2442143.bXjVvZE6mh--




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?7424243.zp5tqGREgJ>