From owner-freebsd-arch@freebsd.org Mon Nov 23 04:14:49 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 13227A35E7E for ; Mon, 23 Nov 2015 04:14:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 83BD617A8; Mon, 23 Nov 2015 04:14:47 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c211-30-166-197.carlnfd1.nsw.optusnet.com.au (c211-30-166-197.carlnfd1.nsw.optusnet.com.au [211.30.166.197]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 7F838D48E13; Mon, 23 Nov 2015 14:48:14 +1100 (AEDT) Date: Mon, 23 Nov 2015 14:48:14 +1100 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Mark Johnston cc: Jilles Tjoelker , freebsd-arch@freebsd.org Subject: Re: zero-cost SDT probes In-Reply-To: <20151122235903.GA5647@raichu> Message-ID: <20151123113932.C906@besplex.bde.org> References: <20151122024542.GA44664@wkstn-mjohnston.west.isilon.com> <20151122164446.GA22980@stack.nl> <20151122235903.GA5647@raichu> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=cK4dyQqN c=1 sm=1 tr=0 a=KA6XNC2GZCFrdESI5ZmdjQ==:117 a=PO7r1zJSAAAA:8 a=JzwRw_2MAAAA:8 a=kj9zAlcOel0A:10 a=6I5d2MoRAAAA:8 a=W0WeRUywduGuKb2zYbgA:9 a=Zi4IgBJU2n4zak7G:21 a=XN6Sc2SuarRgwsJa:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 23 Nov 2015 04:14:49 -0000 On Sun, 22 Nov 2015, Mark Johnston wrote: > On Sun, Nov 22, 2015 at 05:44:46PM +0100, Jilles Tjoelker wrote: >> On Sat, Nov 21, 2015 at 06:45:42PM -0800, Mark Johnston wrote: >>> For the past while I've been experimenting with various ways to >>> implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT >>> probe site expands to this: >> >>> if (func_ptr != NULL) >>> func_ptr(); >> >>> When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise >>> it's NULL. With zero-cost probes, the SDT_PROBE macros expand to >> >>> func(); >> >>> When the kernel is running, each probe site has been overwritten with >>> NOPs. When a probe is enabled, one of the NOPs is overwritten with a >>> breakpoint, and the handler uses the PC to figure out which probe fired. >>> This approach has the benefit of incurring less overhead when the probe >>> is not enabled; it's more complicated to implement though, which is why >>> this hasn't already been done. >> >>> I have a working implementation of this for amd64 and i386[1]. Before >>> adding support for the other arches, I'd like to get some idea as to >>> whether the approach described below is sound and acceptable. >> >> I have not run any benchmarks but I expect that this removes only a >> small part of the overhead of disabled probes. Saving and restoring >> caller-save registers and setting up parameters certainly increases code >> size and I-cache use. On the other hand, a branch that is always or >> never taken will generally cost at most 2 cycles. Hi resolution kernel profiling (which adds a function call and return to every function without engrotting the source code with explicit calls) has surprisingly little overhead when not in use. This depends partly on it not using -finstrument-functions. -finstrument-functions produces bloated code to pass 2 args (function pointer and frame pointer) in a portable way. (Unfortunately, -mprofiler-epilogue was broken (turned into a no-op) in gcc-4.2.1 and is more broken (unsupported) in clang. It is regression-tested by configuring LINT with -pp, but the test is broken for gcc-4.2.1 by accepting -mprofiler-epilogue without actually supporting it, and for clang by ifdefing the addition of -mprofiler-epilogue. I normally don't notice the bug since I use old versions of FreeBSD with a working gcc (3.3.3), and recently fixed 4.2.1. The fix is fragile, and I just noticed that it doesn't work for functions returning a value that don't have an explicit return statement (this is permitted for main()). Old versions of FreeBSD used -finstrument-functions during a previous round of compiler breakage.) -pg and -mprofiler-epilogue produce calls to mcount and .mexitcount for every function (and most trap handlers and some jumps). These functions execute "cmpl $ENABLE,enable_flag; jne done; ... done: ret $0" when the feature is not enable. The branch is easy to predict since there are only 2 instances of it. Scattered instances in callers might bust the branch target cache. So might the scattered calls to .mcount and .mexitcount. Calls are similar to unconditional branches so they are easy to predict, but I think old x86's (ones not too old to have a branch target cache) don't handle them well. High resolution kernel profiling doesn't support SMP, except in my version where it is too slow to use with SMP (due to enormous lock contention). Low resolution kernel profiling has bad locking for the SMP case. The bad locking first gives the enormous lock contention even when profiling is not enabled, since it is done before checking the flag. It also gives deadlock in ddb and some other traps. In the non-SMP case, similar wrapping before checking the flag makes the low-res case is a smaller pessimization. A common function checking the flags wouldn't so well if func_ptr or depends on the call site. But if you can figure out everything from the call site address then it could work similarly to the breakpoint instruction. Kernel profiling could use similar nop/breakpoint methods. It is much easier to patch since there are only 2 functions and 2 args. It already determines the args from the frame. This is the main difference between it and -finstrument functions. > I've done some microbenchmarks using the lockstat probes on a Xeon > E5-2630 with SMT disabled. They just read the TSC and acquire/release a > lock in a loop, so there's no contention. In general I see at most a small > difference between the old and new SDT implementations and a kernel with > KDTRACE_HOOKS off altogether. For example, in my test a mtx lock/unlock > pair takes 52 cycles on average without probes; with probes, it's 54 > cycles with both SDT implementations. rw read locks are 77 cycles > without probes, 79 with. rw write locks and sx exclusive locks don't > appear to show any differences, and sx shared locks show the same > timings without KDTRACE_HOOKS and with the new SDT implementation; the > current implementation adds a cycle per acquire/release pair. When high resolution profiling is enabled, it takes similar times. It also uses the TSC (directly and not slowed down by synchronization) and this use is a major difference between it and low resolution profiling. Overheads on haswell: UP: 60 cycles for .mcount and 40 cycles for .mexitcount; SMP: 88 cycles for .mcount and 60 cycles for .mexitcount. IIRC, rdtsc takes 24 cycles on haswell -- more than half of the UP .mexitcount time; synchronizing it takes another 12-24 cycles; it was much slower on older Intel x86 (65 cycles on freefall with its previous CPU) and much faster on older amd x86 (9 cycles on Athlon XP). This slowness make sit not very useful to optimize other things in anything that uses the TSC. I can believe a mere 2 cycle difference between the highly optimized version and the simple version. It seems hardly worth the effort to optimize. Arg pushing and tests and/or calls are fast, especially when they don't do anything. It might be possible to optimize them better to reduce dependencies, so that they take zero time if they can be executed in parallel. The 100-200 extra cyles per function given by enabled kernel profiling make a difference of about 100% (twice as slow). With giant locking, this scales with the number of CPUs in the kernel (800%) (8-9 times slower) with 8 such CPUs. > None of this takes into account the cache effects of these probes. One > advantage of the proposed implementation is that we eliminate the data > access required to test if the probe is enabled in the first place. I'm Accesses to the same variable cost little. > also a bit uncertain about the I-cache impact. My understanding is that > a fetch of an instruction will load the entire cache line containing > that instruction. So unless the argument-marshalling instructions for a > probe site spans at least one cache line, won't all they all be loaded > anyway? This can be moved out of the way. I don't like __predict_*(), but one thing it can do right is this. The generated code for an inline flags test should look something like: cmpl $ENABLE,enable_flag je slow_path back: # main path here ret slow_path: cmpl $NULL,func_ptr je back # load func_ptr and args # ... jmp back >>> if (func_ptr != NULL) >>> func_ptr(); With everything non-inline, arg loading is kept out of the way without really trying: call decoder # main path here ret The call can be replaced by nops or a trap instruction + nops (or start like that). 'decoder' starts with the flags test. Loading args is difficult if they are local in the caller's frame. Oops, if they were in registers, then they would have to be saved in the frame for the call to decoder, and that is almost as slow as passing them. So the inline flags method may even be better than the no-op method -- it allows the compiler to do the register saving only in the slow path. In fact, I think you can do everything with this method and no complex decoding: # The following is identical to the above except for this comment. # To optimise this, replace the next 2 instructions by nops or # a trap instruction and nops. This gives only a time optimization. # Not large, but easy to do. When not enabled, space is wasted # far away where it doesn't mess up the cache. cmpl $ENABLE,enable_flag je slow_path back: # main path here ret slow_path: cmpl $NULL,func_ptr je back # load func_ptr and args # ... jmp back High resolution profiling deals with this partially as follows: - calling .mcount and .mexitcount tends to require more saving than necessary. gcc is not smart about placing these calls in the best position. The main thing done wrong is frame pointer handling. - but it is arranged that .mexitcount is called without saving the return register(s). The callee preserves them if necessary (only necessary if profiling is enabled). > Consider the disassemblies for __mtx_lock_flags() here: > https://people.freebsd.org/~markj/__mtx_lock_flags_disas.txt > Based on what I said above and assuming a 64-byte cache line size, I'd > expect all instructions between 0xffffffff806d1328 and 0xffffffff806d134e > to be loaded regardless of whether or not the branch is taken. Is that not > the case? They are large and ugly already. It probably doesn't matter for them. Lots of instructions can be loaded (and excecuted speculatively) during the slow locked instruction. Perhaps during the many earlier instructions. The instructions at the branch target are more important. I think the previous branch is usually taken (for no contention) and the branche at ...1328 is unimportant. Both branches are into another cache line and there is no padding to align the branch targets because such padding would not be useful for the target arch (or the optimizations are not complete). I think it is only important that enough instructions in the branch target are in the same cache line as the target. That usually happens accidentally when the target is the function epilogue. > I'll also add that with this change the size of the kernel text shrinks > a fair bit: from 8425096 bytes to 7983496 bytes with a custom MINIMAL-like > kernel with lock inlining. Kernel profiling also bloats the kernel a lot. I was happy to get back to the lesser bloat given by .mcount/.mexitcount instead of __cyg_profile_name_too_long_to_remember(). But the text space needed is small compared with the data space which is allocated at runtime (the allocation is stupid and allocates space even if profiling is never used). > Finally, I should have noted in my first post that this work has other > motivations beyond possible performance improvements. In particular, > recording call sites allows us to finally fill in the function component > of SDT probes automatically. For example, with this work it becomes > possible to enable the udp:::receive probe in udp6_receive(), but not > the one in udp_receive(). Patching inline "cmpl $ENABLE,enable flag" also allows this very easily, unless you want to vary the args for a single call site. Yet another variation, which is easier to implement since it only requires modifying data: call decoder testl %rax,%rax je slow_path # rest as above 'decoder' now just checks a table of enable flags indexed by the call site and returns true/false. All the args passing remains inline (but far away). (First check a global enable flag to optimize for the non-enabled case). >> Avoiding this overhead would require not generating an ABI function call >> but a point where the probe parameters can be calculated from the >> registers and stack frame (like how a debugger prints local variables, >> but with a guarantee that "optimized out" will not happen). This >> requires compiler changes, though, and DTrace has generally not used >> DWARF-like debug information. I use gcc -O1 -fno-inline-functions-called-once, and usually i386 and -march=i386, to prevent bogus optimizations. Maximal compiler optimizations rarely gain as much as 1% in kernel code, even in micro-benchmarks. Compilers like to produce large code that gives negative optimizations except in loops, and kernels don't have many loops. The above flags make ddb and stack traces almost correct on i386 for public functions. Static functions are often called with args in registers even on i386. To debug these, or to make them available using "call", they must be declared with a regparm attribute (cdecl only does enough for public functions where it is the default anyway). >> For a fairer comparison, the five NOPs should be changed to one or two >> longer NOPs, since many CPUs decode at most 3 or 4 instructions per >> cycle. Some examples of longer NOPs are in >> contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp >> X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any >> x86 CPU. > > I'll try that, thanks. On amd64 at least, I think we'd have to use two > NOPs: a single-byte NOP that can be overwritten when the probe is > enabled, and then a four-byte NOP. I think some arches need more than 1 byte for a trap/breakpoint instruction. I jut remembered what the "ret $0" in .mcount does. It is because plain ret doesn't work so well at a branch target, even when the target is aligned. IIRC, "nop; ret" is just as good a fix on the arches where plain ret is slower, but "ret $0" is better on older in-order arches where nop is not so special. Extra nops or larger nop instructions should also be sprinkled for alignment. It is too hard to know the best number to use in asm code. Compilers generate magic amounts depending on the arch and how far away the instruction pointer is from an alignment boundary (so as to not align if too far away). I don't trust compilers to get this right either, even if the arch matches exactly. In the too-far-away cases, it is likely that a small change nearby moves closer and thus gives faster code. Bruce