From owner-freebsd-arch@freebsd.org Sun Nov 22 23:59:11 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5FEF0A352AA for ; Sun, 22 Nov 2015 23:59:11 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: from mail-pa0-x232.google.com (mail-pa0-x232.google.com [IPv6:2607:f8b0:400e:c03::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2990E18BC for ; Sun, 22 Nov 2015 23:59:11 +0000 (UTC) (envelope-from markjdb@gmail.com) Received: by pabfh17 with SMTP id fh17so178088025pab.0 for ; Sun, 22 Nov 2015 15:59:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=u8kkYdukwBBxJi4rNqvHc/cdN7mqEhIeD1BI5VhIFvI=; b=RBQOXmsJz/YGCOBm19MsQD2GrMcc8MQuQgbSv2HJBtJVTvJzqbfb5A9iFGnV+SFHGP CO4BDPnd/DgZ1PSkzItVEY6DjlyBWfHr3jnBClyAMRA4FbsRdY2m+7bXVFkpy7LBSX9c UrXVanRvqM/ogwieE4FMPVRg8XcYs0vJ8C0f1XxmzQbLHhgZccQ4q7owpWUkTEk3evPA 4i8cPIULsaXhcWj16dRmoHze6p9JFPu77CsK1TYAxPBxGcmBsO6+6f6SabnGxo429swC a7pP8rGMl+IXcSBI1nts3BEWDId1CNKwVHA8La3bhUkCJh66Rz6IwVMk8C8nOWhujAYK qtEg== X-Received: by 10.66.122.39 with SMTP id lp7mr32974885pab.74.1448236748390; Sun, 22 Nov 2015 15:59:08 -0800 (PST) Received: from raichu ([104.232.114.184]) by smtp.gmail.com with ESMTPSA id rm10sm7785083pbc.96.2015.11.22.15.59.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 22 Nov 2015 15:59:07 -0800 (PST) Sender: Mark Johnston Date: Sun, 22 Nov 2015 15:59:03 -0800 From: Mark Johnston To: Jilles Tjoelker Cc: freebsd-arch@FreeBSD.org Subject: Re: zero-cost SDT probes Message-ID: <20151122235903.GA5647@raichu> References: <20151122024542.GA44664@wkstn-mjohnston.west.isilon.com> <20151122164446.GA22980@stack.nl> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20151122164446.GA22980@stack.nl> User-Agent: Mutt/1.5.24 (2015-08-30) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 22 Nov 2015 23:59:11 -0000 On Sun, Nov 22, 2015 at 05:44:46PM +0100, Jilles Tjoelker wrote: > On Sat, Nov 21, 2015 at 06:45:42PM -0800, Mark Johnston wrote: > > For the past while I've been experimenting with various ways to > > implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT > > probe site expands to this: > > > if (func_ptr != NULL) > > func_ptr(); > > > When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise > > it's NULL. With zero-cost probes, the SDT_PROBE macros expand to > > > func(); > > > When the kernel is running, each probe site has been overwritten with > > NOPs. When a probe is enabled, one of the NOPs is overwritten with a > > breakpoint, and the handler uses the PC to figure out which probe fired. > > This approach has the benefit of incurring less overhead when the probe > > is not enabled; it's more complicated to implement though, which is why > > this hasn't already been done. > > > I have a working implementation of this for amd64 and i386[1]. Before > > adding support for the other arches, I'd like to get some idea as to > > whether the approach described below is sound and acceptable. > > I have not run any benchmarks but I expect that this removes only a > small part of the overhead of disabled probes. Saving and restoring > caller-save registers and setting up parameters certainly increases code > size and I-cache use. On the other hand, a branch that is always or > never taken will generally cost at most 2 cycles. I've done some microbenchmarks using the lockstat probes on a Xeon E5-2630 with SMT disabled. They just read the TSC and acquire/release a lock in a loop, so there's no contention. In general I see at most a small difference between the old and new SDT implementations and a kernel with KDTRACE_HOOKS off altogether. For example, in my test a mtx lock/unlock pair takes 52 cycles on average without probes; with probes, it's 54 cycles with both SDT implementations. rw read locks are 77 cycles without probes, 79 with. rw write locks and sx exclusive locks don't appear to show any differences, and sx shared locks show the same timings without KDTRACE_HOOKS and with the new SDT implementation; the current implementation adds a cycle per acquire/release pair. None of this takes into account the cache effects of these probes. One advantage of the proposed implementation is that we eliminate the data access required to test if the probe is enabled in the first place. I'm also a bit uncertain about the I-cache impact. My understanding is that a fetch of an instruction will load the entire cache line containing that instruction. So unless the argument-marshalling instructions for a probe site spans at least one cache line, won't all they all be loaded anyway? Consider the disassemblies for __mtx_lock_flags() here: https://people.freebsd.org/~markj/__mtx_lock_flags_disas.txt Based on what I said above and assuming a 64-byte cache line size, I'd expect all instructions between 0xffffffff806d1328 and 0xffffffff806d134e to be loaded regardless of whether or not the branch is taken. Is that not the case? I'll also add that with this change the size of the kernel text shrinks a fair bit: from 8425096 bytes to 7983496 bytes with a custom MINIMAL-like kernel with lock inlining. Finally, I should have noted in my first post that this work has other motivations beyond possible performance improvements. In particular, recording call sites allows us to finally fill in the function component of SDT probes automatically. For example, with this work it becomes possible to enable the udp:::receive probe in udp6_receive(), but not the one in udp_receive(). Generally, DTrace probes that correspond to a specific instruction are said to be "anchored"; DTrace implements various bytecode operations differently depending on whether the probe is anchored, and SDT probes are expected to be, but with the current implementation they're not. As a result, some operations, such as stack(), do not work correctly with SDT probes. r288363 is a workaround for this problem; the change I proposed is a real solution. This is also a step towards fixing lockstat(1)'s caller identification when locks are not inlined. > > Avoiding this overhead would require not generating an ABI function call > but a point where the probe parameters can be calculated from the > registers and stack frame (like how a debugger prints local variables, > but with a guarantee that "optimized out" will not happen). This > requires compiler changes, though, and DTrace has generally not used > DWARF-like debug information. Integrating DWARF information into libdtrace has been something I've been slowly working on, with the goal of being able to place probes on arbitrary instructions instead of just function boundaries. But as you point out, compiler support is needed for any of this to be reliably useful for SDT. > > For a fairer comparison, the five NOPs should be changed to one or two > longer NOPs, since many CPUs decode at most 3 or 4 instructions per > cycle. Some examples of longer NOPs are in > contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp > X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any > x86 CPU. I'll try that, thanks. On amd64 at least, I think we'd have to use two NOPs: a single-byte NOP that can be overwritten when the probe is enabled, and then a four-byte NOP.