From owner-freebsd-arm@freebsd.org Mon Feb 13 08:27:41 2017 Return-Path: Delivered-To: freebsd-arm@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 283D9CDC815 for ; Mon, 13 Feb 2017 08:27:41 +0000 (UTC) (envelope-from alexandre.martins@stormshield.eu) Received: from work.stormshield.eu (gwlille.netasq.com [91.212.116.1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CDFC51B7A; Mon, 13 Feb 2017 08:27:39 +0000 (UTC) (envelope-from alexandre.martins@stormshield.eu) Received: from work.stormshield.eu (localhost [127.0.0.1]) by work.stormshield.eu (Postfix) with ESMTPS id 87A0B3760C51; Mon, 13 Feb 2017 09:26:40 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by work.stormshield.eu (Postfix) with ESMTP id 789903760C58; Mon, 13 Feb 2017 09:26:40 +0100 (CET) Received: from work.stormshield.eu ([127.0.0.1]) by localhost (work.stormshield.eu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id gNa5HxvGZ8gM; Mon, 13 Feb 2017 09:26:40 +0100 (CET) Received: from pc-alex.localnet (fwlabo.stormshield.eu [10.2.0.1]) by work.stormshield.eu (Postfix) with ESMTP id 4E3BC3760C51; Mon, 13 Feb 2017 09:26:40 +0100 (CET) From: Alexandre Martins To: Mark Millard Cc: freebsd-arm , Ian Lepore Subject: Re: bcopy/memmove optimization broken ? [looks like you are correct to me, I give supporting detail] Date: Mon, 13 Feb 2017 09:28:31 +0100 Message-ID: <7424243.zp5tqGREgJ@pc-alex> Organization: STORMSHIELD User-Agent: KMail/4.14.10 (FreeBSD/10.3-RELEASE-p7; KDE/4.14.10; amd64; ; ) In-Reply-To: <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> References: <5335118.oK1KXXDaG5@pc-alex> <25360EAB-3079-4037-9FB5-B7781ED40FA6@dsl-only.net> MIME-Version: 1.0 Content-Type: multipart/signed; boundary="nextPart2442143.bXjVvZE6mh"; micalg="sha256"; protocol="application/pkcs7-signature" X-BeenThere: freebsd-arm@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "Porting FreeBSD to ARM processors." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 13 Feb 2017 08:27:41 -0000 --nextPart2442143.bXjVvZE6mh Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-1" Le vendredi 10 f=E9vrier 2017, 18:11:23 Mark Millard a =E9crit : > On 2017-Feb-10, at 7:51 AM, Alexandre Martins wrote: > > Hi all > >=20 > > I see in the kernel code of bcopy/memmove an optimization if the co= pied > > region don't overlap: > >=20 > > https://svnweb.freebsd.org/base/head/sys/arm/arm/support.S?view=3Da= nnotate#l > > 403 > >=20 > > I'm a newbie in ARM assembly, so, sorry if i'm wrong, but > > - R0 is the dst (1st parameter) > > - R1 is the src (2nd parameter) > >=20 > > So : > > subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ > > subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ > > cmp r3, r2 /* if (r3 < len) we have an overlap */ > > bcc PIC_SYM(_C_LABEL(memcpy), PLT) > >=20 > > Seems to be inverted. Should it be that ? : > > subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ > > subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ > > cmp r3, r2 /* if (r3 < len) we have an overlap */ > > bcs PIC_SYM(_C_LABEL(memcpy), PLT) > >=20 > >=20 > > Best regards >=20 > I did not expect something so central that has been > around so long to look wrong but. . . >=20 > Going through the supporting details of the original > code based on my looking around here is what I found: >=20 > #include=09 > void *memmove(void *dest, const void *src, size_t n); >=20 > So I'd expect (c'ish notation): > r0=3D=3Ddest > r1=3D=3Dsrc > r2=3D=3Dn >=20 > Then for (the comments vs. code is being challenged): > (The comments do seem to give the intent correctly.) >=20 > cmp r0, r1 > RETeq /* Bail now if src/dst are the same */ > subcc r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ > subcs r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcc PIC_SYM(_C_LABEL(memcpy), PLT) > . . . >=20 > cmp r0,r1 should result in condition code (c'ish notation): >=20 > eq=3D(r0=3D=3Dr1) > cc=3D(r0 cs=3D(r0>=3Dr1) >=20 > (Only the r0 position has no immediate-value alternative.) >=20 >=20 > subcc r3,r0,r1 is: if (cc) r3=3Dr0-r1 // no condition code change > In other words: if (dst=20 > So it does not match the test listed in the comment as > far as I can see. And in (mathematical) integer arithmetic > the r3 result would be negative for dst-src. For > non-negative arithmetic (natural or whole): undefined. >=20 >=20 > subcs r3,r1,r0 is: if (cs) r3=3Dr1-r0 // no condition code change > In other words: if (dst>=3Dsrc) r3=3Dsrc-dst >=20 > So it does not match the test listed in the comment as > far as I can see. And in (mathematical) integer arithmetic > the r3 result would be nonpositive for src-dst. But the > earlier RETeq prevents the zero case, so: negative. For > non-negative arithmetic (natural or whole): undefined. >=20 >=20 > If it was only a normal mathemetical context r3=3D-abs(dst-src) > would be a summary of the two sub instruction sequence as it > is from what I can tell. >=20 > For the purpose the summary should be: r3=3Dabs(dst-src), given > dst!=3Dsrc . There is no need to wonder outside normal > mathematical non-negative arithmetic in the process either. >=20 > Your code change would have the right summary and use only > normal mathematical rules from what I can tell: >=20 > cmp r0, r1 > RETeq /* Bail now if src/dst are the same */ > subcs r3, r0, r1 /* if (dst > src) r3 =3D dst - src */ > subcc r3, r1, r0 /* if (src > dsr) r3 =3D src - dst */ > cmp r3, r2 /* if (r3 < len) we have an overlap */ > bcs PIC_SYM(_C_LABEL(memcpy), PLT) > . . . >=20 > subcs r3,r0,r1 is: if (cs) r3=3Dr0-r1 // no condition code change > In other words: if (dst>=3Dsrc) r3=3Ddst-src. > Given the prior RETeq, that is effectively: if (dst>src) r3=3Ddst-src= . > And that matches the comments and would produce a positive result > for r3, matching the normal mathematical result. >=20 > subcc r3,r1,r0 is: if (cc) r3=3Dr1-r0 // no condition code change > In other words: if (dst And that matches the comments and would produce a positive result > for r3, matching the normal mathematical result. >=20 > Overall summary of the two updated sub?? instructions: > r3=3Dabs(dst-src), given dst!=3Dsrc. >=20 > And that would make for an appropriate comparison to n (i.e., to r2).= >=20 > It appears to have been as it is now since -r143175 when memmove was > added to sys/arm/arm/support.S (2005-Apr-12). >=20 >=20 > =3D=3D=3D > Mark Millard > markmi at dsl-only.net Thank you for this deep anaysis ! I also made some benchmark. It seems that the "Xscale" version of=20 memcpy/memmove is slower that the standard "ARM" on my platform (armada= 388). I do the change by undefine the _ARM_ARCH_5E. #define _ARM_ARCH_5E =3D> Xscale version #undef _ARM_ARCH_5E =3D> "ARM" version There is my results: Block size: 2048 memcpy (Kernel ARM) : 1028.7 MB/s memmove (Kernel ARM) : 616.5 MB/s memcpy (Kernel xscale) : 920.1 MB/s memmove (Kernel xscale) : 618.8 MB/s Block size: 128 memcpy (Kernel ARM) : 1018.5 MB/s memmove (Kernel ARM) : 668.4 MB/s memcpy (Kernel xscale) : 825.9 MB/s memmove (Kernel xscale) : 668.6 MB/s Block size: 64 memcpy (Kernel ARM) : 892.9 MB/s memmove (Kernel ARM) : 667.2 MB/s memcpy (Kernel xscale) : 721.2 MB/s memmove (Kernel xscale) : 668.2 MB/s Block size: 32 memcpy (Kernel ARM) : 620.6 MB/s memmove (Kernel ARM) : 634.6 MB/s memcpy (Kernel xscale) : 504.9 MB/s memmove (Kernel xscale) : 634.5 MB/s Block size; 16 memcpy (Kernel ARM) : 471.8 MB/s memmove (Kernel ARM) : 464.5 MB/s memcpy (Kernel xscale) : 254.5 MB/s memmove (Kernel xscale) : 464.7 MB/s As you can see, more the size of the memcpy is small, more the standard= (ARM)=20 version of memcpy is faster. In addition, the libc version suffer the same problem, but is 15% more = efficiant. What can I do to help you on this point ? Best regards =2D-=20 Alexandre Martins STORMSHIELD --nextPart2442143.bXjVvZE6mh Content-Type: application/pkcs7-signature; name="smime.p7s" Content-Disposition: attachment; filename="smime.p7s" Content-Transfer-Encoding: base64 MIAGCSqGSIb3DQEHAqCAMIACAQExDzANBglghkgBZQMEAgEFADCABgkqhkiG9w0BBwEAAKCCCOUw ggSGMIICbqADAgECAgUA28zw7TANBgkqhkiG9w0BAQsFADBIMQswCQYDVQQGEwJGUjEUMBIGA1UE CgwLU1RPUk1TSElFTEQxIzAhBgNVBAMMGlN0b3Jtc2hpZWxkIFJvb3QgQXV0aG9yaXR5MB4XDTE0 MDkwNDE1MDcxMFoXDTI0MDkwMTE1MDcxMFowSTELMAkGA1UEBhMCRlIxFDASBgNVBAoMC1NUT1JN U0hJRUxEMSQwIgYDVQQDDBtTdG9ybXNoaWVsZCBVc2VycyBBdXRob3JpdHkwggEiMA0GCSqGSIb3 DQEBAQUAA4IBDwAwggEKAoIBAQDChwWgC/6VWKL7jgWI3eA2sVvRdOwuHcXsRAAXVWdlMC0ygg7u 45E78GhAnpdl8QbIu7x/Q2zOq6KttspwDEIjkoMLTZngLLlGjYJZPfuSoC6hl9R7vRd5f8Fhu3v0 xQ/7vzKYz4C836IGCrk31gmrPO0H0fxkyxCMfhoTTzue3oXW1IsmQwCrOPOu2Y82QANDhbifWLjI WJetnj58YRKR82KBs3Flbqxtp0mi9+IswMvCCRSoT+ORB73Cl6URt7Qm7BcD+qnkJ9uwlUC94dIl T2hX4ybY/w/ssA17Ew418fgyRCWQXzgjZgZ/XUcw2WP9dIggA7Pg+c/xeROJH1zvAgMBAAGjdjB0 MB0GA1UdDgQWBBShbYRsooCFBXx8dXWANMETW5fXgTAfBgNVHSMEGDAWgBS4Qqn6Z0Twf9NhjOyl x1CutL3sozAPBgNVHRMBAf8EBTADAQH/MA4GA1UdDwEB/wQEAwIBBjARBglghkgBhvhCAQEEBAMC AQYwDQYJKoZIhvcNAQELBQADggIBAE6C9zkt2J6dPm2KLbzRS6rBIYZNFi0X59g3ekQ2Sc4UWsq+ B3L86j5xnQSRnIM/DKV1+Q2UHbU/qsh4cto2fwTV6V+aJ07Vu/bJE1rAN4AI4V26ytf7VoBcBjVZ Jq8pHOMp/G2eQH7F1xqzml68DpKku66aUalkcC9IM82m7AW3YAyvDoYEAchv4qyL8qhVLLp6LNru 8ZOhMELhZLWl4ulw/SFDMhcBS6I4wC6icj71MLGSrr61vMktMdwQ+CGFQ5z5JbaxM61VgzKay8+g lw+xTbpnitrDfhkzHs2fdwOOur3vtNnNsrdBWiYPseJ2k4VGD7ov5kITQZckmZyF/V+Ir//agJQG VuwhDZCXgXOvrje+FLYp7tQ9pgSvLbluh1A+ywfyHnFI4n6tZy9SD3MIDgWR4KwFLM1Qmt3NQb32 tkq9Vm0jUcQXFfbnWKLA9krw3m8NmCqhL5PzpfOegYOc0QJWfMQamxeWxXMLk6uKisS//+VqfpCa 5Jx53t+9DmoN1+ob4jOprPaX6tfBBr5djah2yzPGjHEB52VgWXxIF9lCM2z7Qw+zFb2PIdNeSjIk NEFg/1orKAAa5gQXAQynN2J7E+aLf2XLhHcS0v+9yoisPEw9+Tb5F1uQh+gzYD5JUUYcYWncnX8g P8k6X+F5mQ/81IoNL/IejxJgy/LoMIIEVzCCAz+gAwIBAgIFAIUoy7swDQYJKoZIhvcNAQELBQAw STELMAkGA1UEBhMCRlIxFDASBgNVBAoMC1NUT1JNU0hJRUxEMSQwIgYDVQQDDBtTdG9ybXNoaWVs ZCBVc2VycyBBdXRob3JpdHkwHhcNMTYwOTAxMTUxMTA4WhcNMTcwOTAxMTUxMTA4WjBwMQswCQYD VQQGEwJGUjEUMBIGA1UECgwLU1RPUk1TSElFTEQxGjAYBgNVBAMMEUFsZXhhbmRyZSBNQVJUSU5T MS8wLQYJKoZIhvcNAQkBFiBhbGV4YW5kcmUubWFydGluc0BzdG9ybXNoaWVsZC5ldTCCASIwDQYJ KoZIhvcNAQEBBQADggEPADCCAQoCggEBAMN+CnvE13jKEwJ+OyMzwBpC02dY+LpD5luJwnJTVnV2 9aUjEMI+xGFMMHd9kSIVInbk4WDe1ELOKerg0dzgnkRiOHECSGum1UhcZABxQgY2cmSffNQ6JVro 52UaBlt3aTOk3imYJCHUIGgOWMvOtRc8BxyBHdi15FZPj/F9I+AKufRFsBXUakplFIAPEwy3m2eR a/eCMLqGJUyK7YmsAlEnYn2mA38zIoqtKvL6KPHtrV8fw1SRLQ13+j9nu1LlCaqhmLtILFxhV0/9 uDTvx5cKtZ8Xj1nPM6NUUrso9qlXwm4On6Y34pVTtnYGMQRuljil3Hiz84RJjPDJYRGwbgkCAwEA AaOCAR0wggEZMB0GA1UdDgQWBBTmRLIwSfhNwbdfV13xt0G0JHYjPDAfBgNVHSMEGDAWgBShbYRs ooCFBXx8dXWANMETW5fXgTAJBgNVHRMEAjAAMA4GA1UdDwEB/wQEAwID6DARBglghkgBhvhCAQEE BAMCBLAwHQYDVR0lBBYwFAYIKwYBBQUHAwQGCCsGAQUFBwMCMEoGA1UdHwRDMEEwP6A9oDuGOWh0 dHBzOi8vcGtpLm5ldGFzcS5jb20vYXV0aC9jZXJ0aWZpY2F0ZXJldm9jYXRpb25saXN0LmNybDAR BgNVHSAECjAIMAYGBFUdIAAwKwYDVR0RBCQwIoEgYWxleGFuZHJlLm1hcnRpbnNAc3Rvcm1zaGll bGQuZXUwDQYJKoZIhvcNAQELBQADggEBALT9NWiAaE6nDev34vShhsyb9lWBOQfCnAMyKwtFy/cU uIoHsxyOanIIQHz0ZtB76GCHDo7RStMyp6RYIefIsxABLhSr4hHapJka9g/X/nxexyr0xyT3IpYQ dmyMSHRT18Z/ZaBlQdyfnS2PYkPHJAHl4iqB4SnQlh3rwFdKTJMgCz413cDxQHytgRPGTiXOhyV7 aS3ANJFha6ZHA8HU9sTslY8ZXSUu94iD3t2kcF3gBb432UKALwryKqnrwzFX68pFpqO5QAjEHaF6 6p1agMb2b3HlQGZrME5wSO6rsZJPYvJEyvrwHxCxjSTkOdPw6GriWGTMrVMU0fVrfptMS1gxggIT MIICDwIBATBSMEkxCzAJBgNVBAYTAkZSMRQwEgYDVQQKDAtTVE9STVNISUVMRDEkMCIGA1UEAwwb U3Rvcm1zaGllbGQgVXNlcnMgQXV0aG9yaXR5AgUAhSjLuzANBglghkgBZQMEAgEFAKCBkzAYBgkq hkiG9w0BCQMxCwYJKoZIhvcNAQcBMBwGCSqGSIb3DQEJBTEPFw0xNzAyMTMwODI4MzFaMCgGCSqG SIb3DQEJDzEbMBkwCwYJYIZIAWUDBAECMAoGCCqGSIb3DQMHMC8GCSqGSIb3DQEJBDEiBCBPtmZr zlzEZ1n86lp3L69T+qUFzB/lFpNI6DhE+cRMDjANBgkqhkiG9w0BAQEFAASCAQCEf5zQhj1rSPh0 ORcAe9GoYYGyXP0538OLfHfiWGFX2AuUTtka3zxkperwctKhl6pUwrDGPs3XqQWwwYRywk0RZL+s 1haTlNsJSsLLOqZVVKPdsd3jC0Me/As5pfOvQ6TReVvVfGbk3KxWyevWO7ClGEfpt+9PstxI9ggi iZ1kyB3FlrcrcVCIpY4QGrrIk42S93Hi49+t7QKjaRq+1rKALsQombtsJsWPfxiOAxKLVdQTdiJ0 rfDRvhD87ljsfwLuMCNih1rnvTNDTZwZwu5feGMDJFPGfvTtqAJC6H0hM8ohM/dznnMFc6stlIq6 m8/Oy805sXn9uEtVSGR1+rdrAAAAAAAA --nextPart2442143.bXjVvZE6mh--