Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 18 Jul 2015 17:36:24 -0700
From:      Mark Johnston <markj@freebsd.org>
To:        Bruce Evans <brde@optusnet.com.au>
Cc:        src-committers@freebsd.org, svn-src-all@freebsd.org, svn-src-head@freebsd.org
Subject:   Re: svn commit: r285664 - in head/sys: kern sys
Message-ID:  <20150719003624.GB2808@raichu>
In-Reply-To: <20150718185937.A1301@besplex.bde.org>
References:  <201507180057.t6I0vVhS076519@repo.freebsd.org> <20150718185937.A1301@besplex.bde.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sat, Jul 18, 2015 at 08:55:07PM +1000, Bruce Evans wrote:
> On Sat, 18 Jul 2015, Mark Johnston wrote:
> 
> > Log:
> >  Pass the lock object to lockstat_nsecs() and return immediately if
> >  LO_NOPROFILE is set. Some timecounter handlers acquire a spin mutex, and
> >  we don't want to recurse if lockstat probes are enabled.
> 
> It is an error to call timecounter code from a low-level place like the
> mutex implementation.  This workaround depends on all locks in timecounter
> handlers being LO_NOPROFILE, and that breaks profiling of these locks.
> The breakage is largest if the locks are used for more than timecounters.
> E.g., the one for i386's obsolescent i8254 timer has to be used for all
> accesses to the i8254.  This lock is configured as MTX_SPIN | MTX_NOPROFILE
> and is perhaps the main one fixed by this commit.

I noticed that lock_profile (which predates lockstat a bit) does the
exact same thing to avoid recursion. Specifically,
lock_profile_obtain_lock_{success,failed} return immediately if
LO_NOPROFILE is set on the target lock. As you pointed out,
this change breaks profiling of timecounter locks, but the only
timecounter implementation in the tree that currently acquires a lock
during a timer read is i8254.

The other two locks that set MTX_NOPROFILE are the witness lock and the
i386 icu lock. Lock order checking can usually be done without taking
the witness lock, so contention would be unusual, and it would strike
me as strange to profile locking with witness enabled anyway. :)
I'm not sure why i386's icu_lock has profiling disabled; this was done
in r166001, but the commit log doesn't explain the reason.

It might also be worth noting that lockstat can still be used to observe
acquisitions of LO_NOPROFILE locks - the change prevents measurement of
the time spent contending for such locks.

> 
> KTR handles this problem badly in a different way.  It abuses
> get_cyclecount().  The abuse is parametrized by defining KTR_TIME as
> get_cyclecount() and using KTR_TIME.  The definition of KTR_TIME is
> ifdefed but not in a usable way.  This is of course undocumented, and
> KTR_TIME is not a normal kernel option so it is hard to define in
> config files.
> 
> KTR needs to be about as careful about this as lockstat since it is
> called from low level mutex code for debugging.  I think there is nothing
> like LO_NOPROFILE to turn it off.  Of course you can turn off the mutex
> code's calls to it using MTX_QUIET or MTX_NOWITNESS, but this breaks much
> more than lock profiling.  (I rarely use any lock debugging, but when
> I did I found it useful to use it without WITNESS_SKIPSPIN and with
> MTX_QUIET and MTX_NOWITNESS turned off for the locks being debugged.
> Some console drivers use spinlocks with MTX_QUIET or MTX_NOWITNESS to
> avoid various problems, and this breaks witnessing of them even for
> non-console uses.  Unbreaking their witnessing uncovered further bugs.)
> 
> Abusing get_cyclecount() in KTR doesn't even fix the problem with
> timecounters, since some implementations of get_cyclecount() (some
> arm, some i386) use the timecounter.
> 
> Other bugs with using get_cyclecount():
> - even if it returns cycles counted by a clock, the frequency of this
>    clock is unknown/unobtainable, so there is no way to convert to
>    human-readable units
> - some arches return a shifted binuptime() that wraps very quickly
>    (256 seconds for arm with ARM_ARCH < 6).  Such counters cannot be
>    monotonic.
> - some arches (arm again?) used to use binuptime() with swizzling more
>    complicated than shifting.  Their counters were further from being
>    monotonic.
> - arm is now the only arch that uses binuptime() at a low level.
>    arm64 use a stub that returns 1.  ARM_ARCH < 6 does the above.  i386
>    returns cpu_ticks().  This is the correct method.  It uses the
>    hardware counter use by cpu_ticks() if possible, else the raw value
>    of the current hardware timecounter, adjusted to avoid wrap.
>      (This is not actually a correct method, since its implementation is
>      no good.  The wrap adjustment is down with no locking.  Also,
>      switching the timecounter hardware breaks the get_cyclecount()
>      use even more than the thread resource usage use.  The latter
>      has incomplete recalibration.)
> - KTR of course has no support for converting to human-readable units.
>    ktrdump prints raw timestamps and their differences in %16ju format.
>    kern_ktr.c seems to print them only in a ddb command.  This prints
>    the raw value in the much worse format %10.10lld.  This uses the
>    long long abomination and has a sign error (ktr_timestamp has type
>    uint64_t).
> - ktr's "timestamps" are thus at best a cookie that increases strictly
>    monotonically
> - get_cyclecount() is always rdtsc() on modern x86.  rdtsc() is not
>    a serializing instruction.  I recently learned how unserial it can
>    be -- hundreds of cycles on freefall.  The hundred-cycle case probably
>    doesn't happen much in KTR's use.  It happens in the test program
>    when the program spins waiting for something and executes rdtsc()
>    after that, or so it thinks.  The CPU actually executes rdtsc()
>    speculatively hundreds of cycles earlier while spinning.  Back to
>    back rdtsc()s seem to be executed in order on freefall, and the
>    speculation seems to extend to only the first one.  So there is
>    a good chance that get_cyclecount() called on the same CPU gives
>    a strictly increasing value.  But across CPUs, the hundred-cycle
>    difference are the usual case if there is no synchronization of
>    the threads (even when the TSCs on the different CPUs are perfectly
>    synchronized at the hardware level).
> - get_cyclecount() doesn't bother with synchronization if it uses the
>    TSC (or equivalent).  This is a feature.  It is supposed to be
>    efficient even if this requires it to return garbage.  But this makes
>    it unusable for almost everything.  Certainly for timestamps.
> - get_cyclecount()'s documentation has rotted to match its misuse.  It
>    is now documented as returning a value that is monotonically
>    increasing inside each CPU.  But it never obviously did that on x86
>    (rdtsc isn't a serializing instruction, and the implementation doesn't
>    even have a compiler barrier), and it doesn't do that on arm < 6
>    (wrap occurs after 256 seconds).  Its documentation does a good job
>    of describing its non-monotonicity across CPUs.  cpu_ticks() also
>    doesn't require any synchronization across CPUs.  This is less clear
>    since it has null documenation.
> 
>    KTR shouldn't try to synchronize threads to get timestamps that are
>    monotonic across CPUs relative to some shared (non-relativistic) clock,
>    although this would be useful for debugging synchronization, since
>    this would be slow (100-200 cycles per sync on freefall) and would
>    break debugging of synchronization bugs by adding synchronization.
> 
>    Timecounters give timestamps that are almost comparable across CPUs,
>    even when they use the TSC timecounter.  They do this by reading a
>    shared hardware timecounter, perhaps even with a lock that serializes
>    everything, or by reading something like a shared TSC that is not
>    fully serialized.  For the TSC, low-level code gives serialization
>    for each CPU.  The TSCs ar hopefully synced at the hardware level,
>    so that when you read them at almost the same physical time on separate
>    CPUs you get almost the same value.  The physical time might be affected
>    relativistically, but CPUs are not yet quite fast enough for that to
>    be a large effect.  The largest effect is probably from pipeline
>    delays.
> 
> Bruce



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20150719003624.GB2808>