Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 22 Nov 2015 17:44:46 +0100
From:      Jilles Tjoelker <jilles@stack.nl>
To:        Mark Johnston <markj@FreeBSD.org>
Cc:        freebsd-arch@FreeBSD.org
Subject:   Re: zero-cost SDT probes
Message-ID:  <20151122164446.GA22980@stack.nl>
In-Reply-To: <20151122024542.GA44664@wkstn-mjohnston.west.isilon.com>
References:  <20151122024542.GA44664@wkstn-mjohnston.west.isilon.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Nov 21, 2015 at 06:45:42PM -0800, Mark Johnston wrote:
> For the past while I've been experimenting with various ways to
> implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
> probe site expands to this:

> if (func_ptr != NULL)
> 	func_ptr(<probe args>);

> When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
> it's NULL. With zero-cost probes, the SDT_PROBE macros expand to

> func(<probe args>);

> When the kernel is running, each probe site has been overwritten with
> NOPs. When a probe is enabled, one of the NOPs is overwritten with a
> breakpoint, and the handler uses the PC to figure out which probe fired.
> This approach has the benefit of incurring less overhead when the probe
> is not enabled; it's more complicated to implement though, which is why
> this hasn't already been done.

> I have a working implementation of this for amd64 and i386[1]. Before
> adding support for the other arches, I'd like to get some idea as to
> whether the approach described below is sound and acceptable.

I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.

Avoiding this overhead would require not generating an ABI function call
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.

For a fairer comparison, the five NOPs should be changed to one or two
longer NOPs, since many CPUs decode at most 3 or 4 instructions per
cycle. Some examples of longer NOPs are in
contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any
x86 CPU.

-- 
Jilles Tjoelker



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151122164446.GA22980>