From owner-svn-src-head@FreeBSD.ORG Tue Nov 22 11:21:14 2011 Return-Path: Delivered-To: svn-src-head@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 628CA106564A; Tue, 22 Nov 2011 11:21:14 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail11.syd.optusnet.com.au (mail11.syd.optusnet.com.au [211.29.132.192]) by mx1.freebsd.org (Postfix) with ESMTP id BC7E28FC15; Tue, 22 Nov 2011 11:21:13 +0000 (UTC) Received: from c211-28-227-231.carlnfd1.nsw.optusnet.com.au (c211-28-227-231.carlnfd1.nsw.optusnet.com.au [211.28.227.231]) by mail11.syd.optusnet.com.au (8.13.1/8.13.1) with ESMTP id pAMBL93X021955 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 22 Nov 2011 22:21:10 +1100 Date: Tue, 22 Nov 2011 22:21:08 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: mdf@FreeBSD.org In-Reply-To: Message-ID: <20111122210422.E8674@besplex.bde.org> References: <201111220250.pAM2oPWC070856@svn.freebsd.org> MIME-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="0-925363442-1321960868=:8674" Cc: svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Eitan Adler Subject: Re: svn commit: r227812 - head/lib/libc/string X-BeenThere: svn-src-head@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: SVN commit messages for the src tree for head/-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 22 Nov 2011 11:21:14 -0000 This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --0-925363442-1321960868=:8674 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 21 Nov 2011 mdf@FreeBSD.org wrote: > On Mon, Nov 21, 2011 at 6:50 PM, Eitan Adler wrote: >> Author: eadler (ports committer) >> Date: Tue Nov 22 02:50:24 2011 >> New Revision: 227812 >> URL: http://svn.freebsd.org/changeset/base/227812 >> >> Log: >> =A0- fix some style(9) nits with my last commit >> =A0- add a comment explaining why I used '|' instead of '||' >> >> =A0Submitted by: danfe@ >> =A0Approved by: =A0emaste@ >> >> Modified: >> =A0head/lib/libc/string/strcasecmp.c >> =A0head/lib/libc/string/strncmp.c Hard 0xa0's make the quotes almost unreadable :-(. >> Modified: head/lib/libc/string/strcasecmp.c >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >> --- head/lib/libc/string/strcasecmp.c =A0 Tue Nov 22 02:27:59 2011 =A0 = =A0 =A0 =A0(r227811) >> +++ head/lib/libc/string/strcasecmp.c =A0 Tue Nov 22 02:50:24 2011 =A0 = =A0 =A0 =A0(r227812) >> @@ -49,7 +49,7 @@ strcasecmp_l(const char *s1, const char >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us1 =3D (const u_char *)= s1, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us2 =3D (const u_char *)= s2; >> =A0 =A0 =A0 =A0if (s1 =3D=3D s2) >> - =A0 =A0 =A0 =A0 =A0 return (0); >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return (0); >> >> =A0 =A0 =A0 =A0FIX_LOCALE(locale); >> >> @@ -73,8 +73,9 @@ strncasecmp_l(const char *s1, const char >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us1 =3D (const u_char *)= s1, >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us2 =3D (const u_char *)= s2; >> >> - =A0 =A0 =A0 if (( s1 =3D=3D s2) | (n =3D=3D 0)) >> - =A0 =A0 =A0 =A0 =A0 return (0); >> + =A0 =A0 =A0 /* use a bitwise or to avoid an additional branch instruct= ion */ >> + =A0 =A0 =A0 if ((s1 =3D=3D s2) | (n =3D=3D 0)) >> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return (0); > > I guess I'm a little confused. Do we really have profiling > information at this level that suggests the overhead of the branch is > significant? I thought most hardware had pretty good > branch-prediction, particularly with speculative execution. Every time I have profiled the string functions, changes like this always have an insignificant effect, even in benchmarks. Normal code spends perhaps 0.01% of its time calling string functions, so the improvements from optimizing string functions are 0.01% of an insignificant amount. If this optimization were any good, then the compiler would already do it. In fact, gcc-4.2.1 already does it -- the reverse of it -- it rewrites= : =09"if ((i =3D=3D 0) | (j =3D=3D 0)) return; test();" into: =09"if (i =3D=3D 0 || j =3D=3D 0) return; test();" with most optimization flags that I tried. This happens even at -O with -march=3Di386, although since a plain i386 doesn't have branch target prediction and has slow branches, the first version is probably actually faster for a plain i386. But gcc-3.3.3 doesn't do this rewrite, even with -O3 -march=3Dathlon-xp, For i386, the generated code is quite slow: gcc-3.3.3 -O -march=3Di386: % =09cmpl=09$0, i % =09sete=09%dl % =09cmpl=09$0, j % =09sete=09%al % =09orl=09%edx, %eax % =09testb=09$1, %al % =09jne=09.L3 % =09...=09=09=09# call to test() not shown here or below % .L3: % =09leave % =09ret `sete' is quite slow on plain i386 too. For athlon-xp: gcc-3.3.3 -O3 -march=3Dathlon-xp: % =09movl=09j, %edx % =09movl=09i, %eax % =09testl=09%eax, %eax % =09sete=09%cl % =09testl=09%edx, %edx % =09sete=09%al % =09orl=09%ecx, %eax % =09testb=09$1, %al % =09je=09.L4 % .L3: % =09leave % =09ret This is essentially the same, except it uses load-test instead of cmp-mem, and the branch is taken in the opposite case. Weirdly, gcc42 does almost exactly the opposite with i386 and athlon-xp: gcc-4.2.1 -O -march=3Di386: % =09cmpl=09$0, i % =09je=09.L5 % =09cmpl=09$0, j % =09je=09.L5 % =09... % .L5: % =09leave % =09ret gcc-4.2.1 -O3 -march=3Dathlon-xp: % =09movl=09i, %edx % =09movl=09j, %eax % =09movl=09%esp, %ebp % =09testl=09%edx, %edx % =09sete=09%dl % =09testl=09%eax, %eax % =09sete=09%al % =09orb=09%al, %dl % =09je=09.L7 % =09leave % =09ret I guess this is because it knows that 'sete' is slow on i386. But for nocona, it goes back to rewriting the expression to use || so that there are 2 branches: gcc-4.2.1 -O3 -march=3Dnocona: % =09movl=09i, %edx % =09testl=09%edx, %edx % =09je=09.L5 % =09movl=09j, %eax % =09testl=09%eax, %eax % =09jne=09.L7 % .L5: % =09popl=09%ebp % =09ret This is the same as for plain i386 and plain -O, except it use load-test instead of cmp-mem the same as gcc-3.3.3 does for athlon-xp. It doesn't take the branch in the opposite case like gcc-3.3.3 does for athlon-xp. > Wouldn't something like __predict_false() have more value for > performance, Unlikely. Both writing | vs ||, and writing __predict_foo(), are only hints to the compiler, since the compiler knows how to rewrite the former but probably shouldn't unless it is sure, and it can ignore __predict_foo() but probably shouldn't unless it is sure. Or perhaps it trusts the programmer's directives for branches as much as ones for `register'. Similarly for `if (foo) ...; else ...;' vs `if (!foo) etc' - the programmer may have chosen the order carefully, but who knows? gcc-3-3.3 and gcc-4.2.1 -O3 should know about the behaviour of branches on athlon-xp, but the above shows that they disagree about whether the branch should be taken. IIRC, with a cold branch target cache on athlon-xp, forward branches are predicted as not taken and backwards branches are predictied as taken (so that loops work right). Since returning is the unusual case, we want to branch to it. This is what happens on i386 and nocona, but not on athlon-xp (with both compilers). Perhaps I remember the rule backwards but gcc knows it. On plain i386, taken branches are slower, so branching to the return is best, and is done. OTOH, if forward branches are predicted as being not taken, and the programmer tries to for optimize this without using __predict_foo(), then the program order should probably be: =09if (foo) =09=09usual_case(); =09else =09=09unusual_case(); This conflicts with early returns probably being the unusual case, since =09if (foo) =09=09return; =09otherstuff(); =09if (bar) =09=09return; =09morestuff(); doesn't require excessive nesting like: =09if (!foo) { =09=09otherstuff(); =09=09if (!bar) =09=09=09morestuff(); =09} =09return; so hopefully the compiler generates branches to returns. But why does gcc do the opposite on athlon-xp? I think using __predict_foo() is only useful in much larger functions than the ones here. Most modern CPUs have very good branch predictors so an extra branch or two doesn't matter (each predicted branch takes ~1 cycle, the same as an arithmetic OR operator). I believe the wins from using __predict_foo() are mainly when it allows moving large code for the unusual case far away, leaving only small code for the usual case nearby; the code for the unusual case then doesn't usually interfere with caches. > or is all this just guess-work? I would much rather have > the code say what it means unless there's real, measurable performance > differences from doing otherwise. The best optimization is extremely machine-dependent. I usually hope the compiler knows how to do it and don't try to optimize stuff like this. I usually find that the compiler barely knows what to do, but it can do better than me across a range of CPUs. Bruce --0-925363442-1321960868=:8674--