Date: Thu, 27 Dec 2012 23:39:44 +1100 (EST) From: Bruce Evans <brde@optusnet.com.au> To: Alfred Perlstein <alfred@ixsystems.com> Cc: "arch@freebsd.org" <arch@freebsd.org>, Adrian Chadd <adrian@freebsd.org>, Alfred Perlstein <bright@mu.org>, Rui Paulo <rpaulo@freebsd.org> Subject: Re: UPDATE Re: making use of userland dtrace on FreeBSD Message-ID: <20121227214354.V965@besplex.bde.org> In-Reply-To: <50DBE0DB.6090804@ixsystems.com> References: <50D49DFF.3060803@ixsystems.com> <50DBC7E2.1070505@mu.org> <CAGE5yCq46NFKKzSUZq=jz0NwEnWdjPTK_0fpZ%2BwWV9FA0BSQCg@mail.gmail.com> <50DBD193.7080505@mu.org> <CAGE5yCrnoNhOh3VaYU3bO6BwA=bpxD5QzkZvD%2BHaUwvXNQ%2BUfw@mail.gmail.com> <50DBE0DB.6090804@ixsystems.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, 26 Dec 2012, Alfred Perlstein wrote: > On 12/26/12 9:32 PM, Peter Wemm wrote: >> On Wed, Dec 26, 2012 at 8:41 PM, Alfred Perlstein <bright@mu.org> wrote: >>> On 12/26/12 8:21 PM, Peter Wemm wrote: >>>> On Wed, Dec 26, 2012 at 8:00 PM, Alfred Perlstein <bright@mu.org> wrote: >>>> >>>>> What would be the drawbacks? I don't want to hurt freebsd for heavy >>>>> performance, but I think this functionality should work out of the box >>>>> for >>>>> most people. It might cost as much as 0.1% performance (more on pre-pentuimpro x86). Frame pointer use is very parallizable so it is often free. >>>> The drawbacks are mostly performance related. It defeats a certain >>>> hardware optimizations for call/return on leaf functions. It'll >>>> mostly affect things like math, crypto, compression and multimedia >>>> libraries (that's ffmpeg, bzip2/gzip/libarchive, openssl, etc) but, we >>>> generally don't seem to care about that sort of performance anyway, so >>>> what's one more loss? >>> >>> Can you clarify some? If it was somewhat easy to re-add >>> -fomit-frame-pointer to critical libraries like this, then that would be >>> OK? >> No, you can't add MD flags like this. The way to do it is see things >> like PIC, WARNS, etc where you can do overrides of defaults on a >> directory basis, and respect the system-wide user overrides. >> >> Remember, -fno-omit-frame-pointer is the default on i386 (except at >> high -O levels with gcc, Not except at high -O levels. gcc -O666 doesn't omit the frame pointer even for leave functions. I dont know where clang, the default >> compiler, draws the line). Other platforms don't even have frame >> pointers. You can't just scatter that switch around the place. clang is very incompatible. It omits the frame pointer even for non-leaf functions, even for -O1. For example, it does tail-call optimization for 'void bar(void); void foo(void) { bar(); }' and reduces this to a single jmp instruction, while gcc generates a call too bar plus a return, plus 3 instructions for frame pointer initialization and finalization, plus 1 instruction for its stack alignment pessimization. This might explain why debugging is even more broken with clang than with gdb. Here is a slightly larger program to test the optimization. % volatile int b; % volatile int f; % % void % bar(void) % { % b++; % } % % void % foo(void) % { % f++; % bar(); % } % % int % main(void) % { % int i; % % for (i = 0; i < 100000000; i++) % foo(); % return (0); % } I had to put the volatiles in to prevent the function calls being optimized away (gcc broke its promise not to optimize away loops like this in gcc-4). This program takes 0.34 seconds on freefall with clang -O and 0.65 seconds with gcc -O. But the faster speed with clang has nothing to do with -f-omit-frame-pointer. It is because clang inlines everything, so that the main loop does just 'f++; b++;'. clang produces this loop even with -g, so debugging is completely broken with clang (breakpoints in the functions don't work). Debugging works correctly with gcc. Profiling is even more broken than debugging with clang. clang generates calls to .mcount in the places where it inline functions, but this cannot work in FreeBSD (and in fact just wastes time to make a mess), since in FreeBSD .mcount is optimized to not take an explicit arg identiying the caller, so for the above it always identifies the wrong caller for foo(). -finstrument-functions seems to be less broken, but doesn't work for either clang or gcc. Both clang and gcc generate calls to __cyg_profile_func_enter/exit() for both actual functions and for inlined functions. __cyg_profile_func_enter() corresponds to .mcount, and __cyg_profile_func_exit() corresponds to FreeBSD (my) .mexitcount feature (.mexitcount is broken (null) in gcc-4.2 and broken (nonexistent) in clang), except the __cyg* functions are pessimized to take 2 explicit args identifing the caller, so they can work; they don't actually work since they are nonexistent in FreeBSD (except in x86 kernels in old versions, where they were used transiently to work around breakage of .mexitcount). After working around these bugs by putting the functions in separate files (and removing the now-unneeded volatiles): main.c: % void foo(void); % % int % main(void) % { % int i; % % for (i = 0; i < 100000000; i++) % foo(); % } foo.c: % void bar(void); % % void % foo(void) % { % bar(); % } bar.c: % void % bar(void) % { % } we can seem how much the frame pointer optimization is saving: this now takes 0.43 seconds with clang and 0.87 seconds with gcc. It is weird that the gcc time increased from 0.65 seconds to 0.87 despite doing less. After adding back the volatiles, the times are 0.43 seconds with clang and 0.85 seconds with gcc -- doing more gave a small optimization, but didn't recover 0.65 seconds. There is apparently some magic alignment or misalignment which costs or saves about the same as omitting the frame pointer. Finally, with gcc -O -fomit-frame-pointer, the program takes 0.60 seconds, and with gcc -O2 -fomit-frame-pointer, it takes 0.49 seconds, and with gcc -O2, it takes 0.49 seconds (this really doesn't omit frame pointers, so omitting the frame pointer saves nothing), With cc -O -fno-omit-frame-pointer, it takes 0.43 seconds, but this case is just broken -- the -fno-omit-frame-pointer is silently ignored :-(. % > Agreed! It seems that -fno-omit-frame-pointer documentation is a bit % > strange, the manual page indicates: % >> -O also turns on -fomit-frame-pointer on machines where doing so % >> does not interfere with debugging. % > Then goes on to specify that under the actual option that it's turned on % > under -O, -O2, -O3, etc. The latter is just wrong for i386 (see above). The former may be correct and differ for amd64 because amd64 has better debugging info and thus can afford to omit the frame pointer more often. However, I've seen anomalies for debugging. I forget the details, but remember that one of i386 and amd64 worked better for debugging libm. Another floating point debugging strangeness is that gdb understands XMM registers better on i386 than on amd64! To see this, try 'gdb /bin/cat'. Run the program and stop it with ^C. Then p $xmm0 shows deficient info for amd64. But amd64 actually uses XMM registers for floating point on amd64 (clang with certain -march also bogusly uses them on i386 too). Thus displaying of XMM is broken where it is most needed. >>>> Of course it wouldn't be required with dwarf unwinding awareness, but >>>> we don't have that. Perhaps the clang optimizations depend on this. >>>> We have -fno-omit-frame-pointer on the amd64 kernel whenever debugging >>>> is compiled in because there's no unwinder for doing stack traces. We Hmm, I didn't notice that. It is also done unconditionally in kmod.mk. It is also done conditionally for powerpc kernels and unconditionally for powerpc kmods. -fno-inline-functions-called-once should be done under the same conditions, to unbreak the stack trace for such functions. Unfortunately, this only works for gcc. For gcc, it is only needed for static functions. The above shows that it is needed even more for clang, since clang inlines non-static functions in the same file (perhaps if they are called more than once?). But -fno-inline-functions-called-once is broken (unsupported, and a warning) for clang. >>>> need a dwarf2+ unwinder and somebody to instrument the call frame >>>> state through the remaining assembler code. I wouldn't want it for ddb. ddb doesn't have access to any debug info except the symbol table. >>>> >>> How much work is that exactly? I've only been a gdb user, not a hacker. >> gdb has a stack unwinder. kdb/ddb/stack(9) do not. There's well >> established GPL code to do it, as well as libunwind and variants. >> Basically what this code has to do is run the dwarf2+ state machine to >> find all the call/return frames instead of assuming the compiler did >> it. Heck, even glibc has a dwarf2 unwinder built into it as part of >> their exception processing system. >> >> I'm not entirely sure what more work src/lib/libelf and >> src/lib/libdwarf need. It looks like its got just enough implemented >> to support the ctfconvert etc and doesn't have an unwinder in it. >> > This really seems beyond my skill level / time allotment. Let's see where > the numbers put us in terms of system performance and then we can make a call > on it. > > I'd rather take a few % of perf for the power of dtrace, but not if that % is > double digits. Since -fno-omit-frame-pointer is broken (silently ignored) for clang, using it won't make any difference. The only ways I could find to get frame pointers with clang were -pg (profiling) and -finstrument-functions. Bruce
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20121227214354.V965>