Date: Tue, 24 Nov 2015 16:11:36 -0800 From: Mark Johnston <markj@FreeBSD.org> To: Konstantin Belousov <kostikbel@gmail.com> Cc: freebsd-arch@FreeBSD.org Subject: Re: zero-cost SDT probes Message-ID: <20151125001136.GB70878@wkstn-mjohnston.west.isilon.com> In-Reply-To: <20151123113511.GX58629@kib.kiev.ua> References: <20151122024542.GA44664@wkstn-mjohnston.west.isilon.com> <20151123113511.GX58629@kib.kiev.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
On Mon, Nov 23, 2015 at 01:35:11PM +0200, Konstantin Belousov wrote: > On Sat, Nov 21, 2015 at 06:45:42PM -0800, Mark Johnston wrote: > > Hi, > > > > For the past while I've been experimenting with various ways to > > implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT > > probe site expands to this: > > > > if (func_ptr != NULL) > > func_ptr(<probe args>); > > > > When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise > > it's NULL. With zero-cost probes, the SDT_PROBE macros expand to > > > > func(<probe args>); > > > > When the kernel is running, each probe site has been overwritten with > > NOPs. When a probe is enabled, one of the NOPs is overwritten with a > > breakpoint, and the handler uses the PC to figure out which probe fired. > > This approach has the benefit of incurring less overhead when the probe > > is not enabled; it's more complicated to implement though, which is why > > this hasn't already been done. > > > > I have a working implementation of this for amd64 and i386[1]. Before > > adding support for the other arches, I'd like to get some idea as to > > whether the approach described below is sound and acceptable. > > > > The main difficulty is in figuring out where the probe sites actually > > are once the kernel is running. In my patch, a probe site is a call to > > an externally-defined function which is defined in an > > automatically-generated C file. At link time, we first perform a partial > > link of all the kernel's object files. Then, a script uses the relocations > > against the still-undefined probe functions to generate > > 1) stub functions for the probes, so that the kernel can actually be > > linked, and > > 2) a linker set containing the offsets of each probe site relative to > > the beginning of the text section. > > The result is linked with the partially-linked kernel to generate the > > final kernel file. > > > > During boot, we iterate over the linker set, using the offsets plus the > > address of btext to overwrite probe sites with NOPs. SDT probes in kernel > > modules are handled differently (and more simply): the kernel linker just > > has special handling for relocations against symbols named __dtrace_sdt_*; > > this is how illumos/Solaris implements all of this. > > > > My uncertainty revolves around the use of relocations in the > > partially-linked kernel to determine the address of probe sites in the > > running kernel. With the GNU ld in base, this happens to work because > > the final link doesn't modify the text section. Is this something I can > > rely upon? Will this assumption be false with the advent of lld and LTO? > > Are there other, cleaner ways to implement what I described above? > > You could consider using a cheap instruction which is conditionally > converted into the trap, instead. E.g., you could have global page frame > in KVA allocated, and for the normal operations, keep the page mapped > with backing by a scratch page. The probe would be a volatile read from > the page. > > When probes are activated, the page is unmapped, which converts the read > into the page fault. This is similar to the write barriers implemented > in some garbare collectors. > > There are two issues with this scheme: > - The cost of probe is relatively large, even if the low level trap > handler is further modified to recognize the probes by special > address access. > - The arguments passed to the probes should be put into some predefined > place, e.g. somwhere in the *curthread, since trap handler cannot fetch > them using the ABI conventions. > > As I mentioned above, this scheme is used by several implementations of > the language runtimes, but there gc pauses are rare, and slightly larger > cost of the even stopping the mutator is justified even by negligible > cost reduction for normal flow. I am not sure if this approach worths > the complications and overhead for probes. If I understood correctly, each probe site would require a separate page in KVA to be able to enable and disable individual probes in the manner that I described in a previous reply. Today, a kernel with lock inlining has thousands of probe sites; wouldn't the requirement of allocating KVA for each of them be prohibitive on 32-bit architectures?
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20151125001136.GB70878>