Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 22 Nov 2011 22:21:08 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        mdf@FreeBSD.org
Cc:        svn-src-head@FreeBSD.org, svn-src-all@FreeBSD.org, src-committers@FreeBSD.org, Eitan Adler <eadler@FreeBSD.org>
Subject:   Re: svn commit: r227812 - head/lib/libc/string
Message-ID:  <20111122210422.E8674@besplex.bde.org>
In-Reply-To: <CAMBSHm8HfW-XTSk9zQda4vDmbrzyEG_vnydb5JZyHaNh2rF4gw@mail.gmail.com>
References:  <201111220250.pAM2oPWC070856@svn.freebsd.org> <CAMBSHm8HfW-XTSk9zQda4vDmbrzyEG_vnydb5JZyHaNh2rF4gw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--0-925363442-1321960868=:8674
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: QUOTED-PRINTABLE

On Mon, 21 Nov 2011 mdf@FreeBSD.org wrote:

> On Mon, Nov 21, 2011 at 6:50 PM, Eitan Adler <eadler@freebsd.org> wrote:
>> Author: eadler (ports committer)
>> Date: Tue Nov 22 02:50:24 2011
>> New Revision: 227812
>> URL: http://svn.freebsd.org/changeset/base/227812
>>
>> Log:
>> =A0- fix some style(9) nits with my last commit
>> =A0- add a comment explaining why I used '|' instead of '||'
>>
>> =A0Submitted by: danfe@
>> =A0Approved by: =A0emaste@
>>
>> Modified:
>> =A0head/lib/libc/string/strcasecmp.c
>> =A0head/lib/libc/string/strncmp.c

Hard 0xa0's make the quotes almost unreadable :-(.

>> Modified: head/lib/libc/string/strcasecmp.c
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D
>> --- head/lib/libc/string/strcasecmp.c =A0 Tue Nov 22 02:27:59 2011 =A0 =
=A0 =A0 =A0(r227811)
>> +++ head/lib/libc/string/strcasecmp.c =A0 Tue Nov 22 02:50:24 2011 =A0 =
=A0 =A0 =A0(r227812)
>> @@ -49,7 +49,7 @@ strcasecmp_l(const char *s1, const char
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us1 =3D (const u_char *)=
s1,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us2 =3D (const u_char *)=
s2;
>> =A0 =A0 =A0 =A0if (s1 =3D=3D s2)
>> - =A0 =A0 =A0 =A0 =A0 return (0);
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return (0);
>>
>> =A0 =A0 =A0 =A0FIX_LOCALE(locale);
>>
>> @@ -73,8 +73,9 @@ strncasecmp_l(const char *s1, const char
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us1 =3D (const u_char *)=
s1,
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0*us2 =3D (const u_char *)=
s2;
>>
>> - =A0 =A0 =A0 if (( s1 =3D=3D s2) | (n =3D=3D 0))
>> - =A0 =A0 =A0 =A0 =A0 return (0);
>> + =A0 =A0 =A0 /* use a bitwise or to avoid an additional branch instruct=
ion */
>> + =A0 =A0 =A0 if ((s1 =3D=3D s2) | (n =3D=3D 0))
>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 return (0);
>
> I guess I'm a little confused.  Do we really have profiling
> information at this level that suggests the overhead of the branch is
> significant?  I thought most hardware had pretty good
> branch-prediction, particularly with speculative execution.

Every time I have profiled the string functions, changes like this always
have an insignificant effect, even in benchmarks.  Normal code spends
perhaps 0.01% of its time calling string functions, so the improvements
from optimizing string functions are 0.01% of an insignificant amount.

If this optimization were any good, then the compiler would already do
it.  In fact, gcc-4.2.1 already does it -- the reverse of it -- it rewrites=
:

 =09"if ((i =3D=3D 0) | (j =3D=3D 0)) return; test();"

into:

 =09"if (i =3D=3D 0 || j =3D=3D 0) return; test();"

with most optimization flags that I tried.  This happens even at -O
with -march=3Di386, although since a plain i386 doesn't have branch
target prediction and has slow branches, the first version is probably
actually faster for a plain i386.  But gcc-3.3.3 doesn't do this
rewrite, even with -O3 -march=3Dathlon-xp, For i386, the generated code
is quite slow:

gcc-3.3.3 -O -march=3Di386:
% =09cmpl=09$0, i
% =09sete=09%dl
% =09cmpl=09$0, j
% =09sete=09%al
% =09orl=09%edx, %eax
% =09testb=09$1, %al
% =09jne=09.L3
% =09...=09=09=09# call to test() not shown here or below
% .L3:
% =09leave
% =09ret

`sete' is quite slow on plain i386 too.  For athlon-xp:

gcc-3.3.3 -O3 -march=3Dathlon-xp:
% =09movl=09j, %edx
% =09movl=09i, %eax
% =09testl=09%eax, %eax
% =09sete=09%cl
% =09testl=09%edx, %edx
% =09sete=09%al
% =09orl=09%ecx, %eax
% =09testb=09$1, %al
% =09je=09.L4
% .L3:
% =09leave
% =09ret

This is essentially the same, except it uses load-test instead of cmp-mem,
and the branch is taken in the opposite case.

Weirdly, gcc42 does almost exactly the opposite with i386 and athlon-xp:

gcc-4.2.1 -O -march=3Di386:
% =09cmpl=09$0, i
% =09je=09.L5
% =09cmpl=09$0, j
% =09je=09.L5
% =09...
% .L5:
% =09leave
% =09ret

gcc-4.2.1 -O3 -march=3Dathlon-xp:
% =09movl=09i, %edx
% =09movl=09j, %eax
% =09movl=09%esp, %ebp
% =09testl=09%edx, %edx
% =09sete=09%dl
% =09testl=09%eax, %eax
% =09sete=09%al
% =09orb=09%al, %dl
% =09je=09.L7
% =09leave
% =09ret

I guess this is because it knows that 'sete' is slow on i386.  But for
nocona, it goes back to rewriting the expression to use || so that there
are 2 branches:

gcc-4.2.1 -O3 -march=3Dnocona:
% =09movl=09i, %edx
% =09testl=09%edx, %edx
% =09je=09.L5
% =09movl=09j, %eax
% =09testl=09%eax, %eax
% =09jne=09.L7
% .L5:
% =09popl=09%ebp
% =09ret

This is the same as for plain i386 and plain -O, except it use load-test
instead of cmp-mem the same as gcc-3.3.3 does for athlon-xp.  It doesn't
take the branch in the opposite case like gcc-3.3.3 does for athlon-xp.

> Wouldn't something like __predict_false() have more value for
> performance,

Unlikely.  Both writing | vs ||, and writing __predict_foo(), are only
hints to the compiler, since the compiler knows how to rewrite the
former but probably shouldn't unless it is sure, and it can ignore
__predict_foo() but probably shouldn't unless it is sure.  Or perhaps
it trusts the programmer's directives for branches as much as ones for
`register'.  Similarly for `if (foo) ...; else ...;' vs `if (!foo)
etc' - the programmer may have chosen the order carefully, but who
knows?  gcc-3-3.3 and gcc-4.2.1 -O3 should know about the behaviour
of branches on athlon-xp, but the above shows that they disagree about
whether the branch should be taken.  IIRC, with a cold branch target
cache on athlon-xp, forward branches are predicted as not taken and
backwards branches are predictied as taken (so that loops work right).
Since returning is the unusual case, we want to branch to it.  This
is what happens on i386 and nocona, but not on athlon-xp (with both
compilers).  Perhaps I remember the rule backwards but gcc knows it.
On plain i386, taken branches are slower, so branching to the return
is best, and is done.  OTOH, if forward branches are predicted as being
not taken, and the programmer tries to for optimize this without using
__predict_foo(), then the program order should probably be:

 =09if (foo)
 =09=09usual_case();
 =09else
 =09=09unusual_case();

This conflicts with early returns probably being the unusual case,
since

 =09if (foo)
 =09=09return;
 =09otherstuff();
 =09if (bar)
 =09=09return;
 =09morestuff();

doesn't require excessive nesting like:

 =09if (!foo) {
 =09=09otherstuff();
 =09=09if (!bar)
 =09=09=09morestuff();
 =09}
 =09return;

so hopefully the compiler generates branches to returns.  But why does
gcc do the opposite on athlon-xp?

I think using __predict_foo() is only useful in much larger functions
than the ones here.  Most modern CPUs have very good branch predictors
so an extra branch or two doesn't matter (each predicted branch takes
~1 cycle, the same as an arithmetic OR operator).  I believe the wins
from using __predict_foo() are mainly when it allows moving large code
for the unusual case far away, leaving only small code for the usual
case nearby; the code for the unusual case then doesn't usually interfere
with caches.

> or is all this just guess-work?  I would much rather have
> the code say what it means unless there's real, measurable performance
> differences from doing otherwise.

The best optimization is extremely machine-dependent.  I usually hope
the compiler knows how to do it and don't try to optimize stuff like
this.  I usually find that the compiler barely knows what to do, but
it can do better than me across a range of CPUs.

Bruce
--0-925363442-1321960868=:8674--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20111122210422.E8674>