Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 27 Dec 2012 23:39:44 +1100 (EST)
From:      Bruce Evans <brde@optusnet.com.au>
To:        Alfred Perlstein <alfred@ixsystems.com>
Cc:        "arch@freebsd.org" <arch@freebsd.org>, Adrian Chadd <adrian@freebsd.org>, Alfred Perlstein <bright@mu.org>, Rui Paulo <rpaulo@freebsd.org>
Subject:   Re: UPDATE Re: making use of userland dtrace on FreeBSD
Message-ID:  <20121227214354.V965@besplex.bde.org>
In-Reply-To: <50DBE0DB.6090804@ixsystems.com>
References:  <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <CAGE5yCq46NFKKzSUZq=jz0NwEnWdjPTK_0fpZ%2BwWV9FA0BSQCg@mail.gmail.com> <50DBD193.7080505@mu.org> <CAGE5yCrnoNhOh3VaYU3bO6BwA=bpxD5QzkZvD%2BHaUwvXNQ%2BUfw@mail.gmail.com> <50DBE0DB.6090804@ixsystems.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Dec 2012, Alfred Perlstein wrote:

> On 12/26/12 9:32 PM, Peter Wemm wrote:
>> On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein <bright@mu.org> wrote:
>>> On 12/26/12 8:21 PM, Peter Wemm wrote:
>>>> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein <bright@mu.org> wrote:
>>>> 
>>>>> What would be the drawbacks?  I don't want to hurt freebsd for heavy
>>>>> performance, but I think this functionality should work out of the box
>>>>> for
>>>>> most people.

It might cost as much as 0.1% performance (more on pre-pentuimpro x86).
Frame pointer use is very parallizable so it is often free.

>>>> The drawbacks are mostly performance related.  It defeats a certain
>>>> hardware optimizations for call/return on leaf functions.  It'll
>>>> mostly affect things like math, crypto, compression and multimedia
>>>> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we
>>>> generally don't seem to care about that sort of performance anyway, so
>>>> what's one more loss?
>>> 
>>> Can you clarify some?  If it was somewhat easy to re-add
>>> -fomit-frame-pointer to critical libraries like this, then that would be 
>>> OK?
>> No, you can't add MD flags like this.  The way to do it is see things
>> like PIC, WARNS, etc where you can do overrides of defaults on a
>> directory basis, and respect the system-wide user overrides.
>> 
>> Remember, -fno-omit-frame-pointer is the default on i386 (except at
>> high -O levels with gcc,

Not except at high -O levels.  gcc -O666 doesn't omit the frame pointer
even for leave functions.

I dont know where clang, the default

>> compiler, draws the line).  Other platforms don't even have frame
>> pointers.  You can't just scatter that switch around the place.

clang is very incompatible.  It omits the frame pointer even for non-leaf
functions, even for -O1.  For example, it does tail-call optimization for
'void bar(void); void foo(void) { bar(); }' and reduces this to a single
jmp instruction, while gcc generates a call too bar plus a return, plus
3 instructions for frame pointer initialization and finalization, plus
1 instruction for its stack alignment pessimization.  This might explain
why debugging is even more broken with clang than with gdb.

Here is a slightly larger program to test the optimization.

% volatile int b;
% volatile int f;
% 
% void
% bar(void)
% {
% 	b++;
% }
% 
% void
% foo(void)
% {
% 	f++;
% 	bar();
% }
% 
% int
% main(void)
% {
% 	int i;
% 
% 	for (i = 0; i < 100000000; i++)
% 		foo();
%	return (0);
% }

I had to put the volatiles in to prevent the function calls being optimized
away (gcc broke its promise not to optimize away loops like this in gcc-4).
This program takes 0.34 seconds on freefall with clang -O and 0.65 seconds
with gcc -O.  But the faster speed with clang has nothing to do with
-f-omit-frame-pointer.  It is because clang inlines everything, so that
the main loop does just 'f++; b++;'.  clang produces this loop even with
-g, so debugging is completely broken with clang (breakpoints in the
functions don't work).  Debugging works correctly with gcc.  Profiling
is even more broken than debugging with clang.  clang generates calls
to .mcount in the places where it inline functions, but this cannot
work in FreeBSD (and in fact just wastes time to make a mess), since in
FreeBSD .mcount is optimized to not take an explicit arg identiying
the caller, so for the above it always identifies the wrong caller for
foo().  -finstrument-functions seems to be less broken, but doesn't work
for either clang or gcc.  Both clang and gcc generate calls to
__cyg_profile_func_enter/exit() for both actual functions and for inlined
functions.  __cyg_profile_func_enter() corresponds to .mcount, and
__cyg_profile_func_exit() corresponds to FreeBSD (my) .mexitcount feature
(.mexitcount is broken (null) in gcc-4.2 and broken (nonexistent) in
clang), except the __cyg* functions are pessimized to take 2 explicit
args identifing the caller, so they can work; they don't actually work
since they are nonexistent in FreeBSD (except in x86 kernels in old versions,
where they were used transiently to work around breakage of .mexitcount).

After working around these bugs by putting the functions in separate files
(and removing the now-unneeded volatiles):

main.c:
% void foo(void);
% 
% int
% main(void)
% {
% 	int i;
% 
% 	for (i = 0; i < 100000000; i++)
% 		foo();
% }

foo.c:
% void bar(void);
% 
% void
% foo(void)
% {
% 	bar();
% }

bar.c:
% void
% bar(void)
% {
% }

we can seem how much the frame pointer optimization is saving: this
now takes 0.43 seconds with clang and 0.87 seconds with gcc.  It
is weird that the gcc time increased from 0.65 seconds to 0.87
despite doing less.  After adding back the volatiles, the times
are 0.43 seconds with clang and 0.85 seconds with gcc -- doing
more gave a small optimization, but didn't recover 0.65 seconds.
There is apparently some magic alignment or misalignment which
costs or saves about the same as omitting the frame pointer.
Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60
seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49
seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't
omit frame pointers, so omitting the frame pointer saves nothing),
With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this
case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(.

% > Agreed!    It seems that -fno-omit-frame-pointer documentation is a bit 
% > strange, the manual page indicates:
% >>            -O also turns on -fomit-frame-pointer on machines where doing so
% >>            does not interfere with debugging.
% > Then goes on to specify that under the actual option that it's turned on 
% > under -O, -O2, -O3, etc.

The latter is just wrong for i386 (see above).  The former may be
correct and differ for amd64 because amd64 has better debugging info
and thus can afford to omit the frame pointer more often.  However,
I've seen anomalies for debugging.  I forget the details, but remember
that one of i386 and amd64 worked better for debugging libm.  Another
floating point debugging strangeness is that gdb understands XMM
registers better on i386 than on amd64!  To see this, try 'gdb
/bin/cat'.  Run the program and stop it with ^C.  Then p $xmm0 shows
deficient info for amd64.  But amd64 actually uses XMM registers for
floating point on amd64 (clang with certain -march also bogusly uses
them on i386 too).  Thus displaying of XMM is broken where it is most
needed.

>>>> Of course it wouldn't be required with dwarf unwinding awareness, but
>>>> we don't have that.

Perhaps the clang optimizations depend on this.

>>>> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging
>>>> is compiled in because there's no unwinder for doing stack traces.  We

Hmm, I didn't notice that.  It is also done unconditionally in kmod.mk.
It is also done conditionally for powerpc kernels and unconditionally
for powerpc kmods.  -fno-inline-functions-called-once should be done
under the same conditions, to unbreak the stack trace for such functions.
Unfortunately, this only works for gcc.  For gcc, it is only needed for
static functions.  The above shows that it is needed even more for clang,
since clang inlines non-static functions in the same file (perhaps if
they are called more than once?).  But -fno-inline-functions-called-once
is broken (unsupported, and a warning) for clang.

>>>> need a dwarf2+ unwinder and somebody to instrument the call frame
>>>> state through the remaining assembler code.

I wouldn't want it for ddb.  ddb doesn't have access to any debug info
except the symbol table.

>>>> 
>>> How much work is that exactly?  I've only been a gdb user, not a hacker.
>> gdb has a stack unwinder.  kdb/ddb/stack(9) do not.  There's well
>> established GPL code to do it, as well as libunwind and variants.
>> Basically what this code has to do is run the dwarf2+ state machine to
>> find all the call/return frames instead of assuming the compiler did
>> it.  Heck, even glibc has a dwarf2 unwinder built into it as part of
>> their exception processing system.
>> 
>> I'm not entirely sure what more work src/lib/libelf and
>> src/lib/libdwarf need.  It looks like its got just enough implemented
>> to support the ctfconvert etc and doesn't have an unwinder in it.
>> 
> This really seems beyond my skill level / time allotment.  Let's see where 
> the numbers put us in terms of system performance and then we can make a call 
> on it.
>
> I'd rather take a few % of perf for the power of dtrace, but not if that % is 
> double digits.

Since -fno-omit-frame-pointer is broken (silently ignored) for clang, using
it won't make any difference.  The only ways I could find to get frame pointers
with clang were -pg (profiling) and -finstrument-functions.

Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121227214354.V965>