From owner-freebsd-hackers@FreeBSD.ORG  Fri Sep 11 15:19:28 2009
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 00660106566C
	for <freebsd-hackers@freebsd.org>; Fri, 11 Sep 2009 15:19:27 +0000 (UTC)
	(envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id B04AB8FC1A
	for <freebsd-hackers@freebsd.org>; Fri, 11 Sep 2009 15:19:27 +0000 (UTC)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 40AC646B03;
	Fri, 11 Sep 2009 11:19:27 -0400 (EDT)
Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 772D48A01B;
	Fri, 11 Sep 2009 11:19:26 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-hackers@freebsd.org
Date: Fri, 11 Sep 2009 11:02:14 -0400
User-Agent: KMail/1.9.7
References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com>
	<237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com>
	<237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com>
In-Reply-To: <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200909111102.14503.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Fri, 11 Sep 2009 11:19:26 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Linda Messerschmidt <linda.messerschmidt@gmail.com>
Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 11 Sep 2009 15:19:28 -0000

On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote:
> Just to follow up, I've been doing some testing with masking for
> KTR_LOCK rather than KTR_SCHED.
> 
> I'm having trouble with this because I have the KTR buffer size set to
> 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for
> even a full second of tracing; the sample I'm working with now is just
> under 0.9s.  It's an average of one entry every 2001 TSC ticks.  That
> *seems* like a lot of locking activity, but some of the lock points
> are only a couple of lines apart, so maybe it's just incredibly
> verbose.
> 
> Since it's so much data and I'm still working on a way to correlate it
> (lockgraph.py?), all I've got so far is a list of what trace points
> are coming up the most:
> 
> 51927 src/sys/kern/kern_lock.c:215  (_lockmgr UNLOCK mtx_unlock() when
> flags & LK_INTERLOCK)
> 48033 src/sys/kern/vfs_subr.c:2284  (vdropl UNLOCK)
> 41548 src/sys/kern/vfs_subr.c:2187  (vput VI_LOCK)
> 29359 src/sys/kern/vfs_subr.c:2067  (vget VI_LOCK)
> 29358 src/sys/kern/vfs_subr.c:2079  (vget VI_UNLOCK)
> 23799 src/sys/nfsclient/nfs_subs.c:755  (nfs_getattrcache mtx_lock)
> 23460 src/sys/nfsclient/nfs_vnops.c:645  (nfs_getattr mtx_unlock)
> 23460 src/sys/nfsclient/nfs_vnops.c:642  (nfs_getattr mtx_lock)
> 23460 src/sys/nfsclient/nfs_subs.c:815  (nfs_getattrcache mtx_unlock)
> 23138 src/sys/kern/vfs_cache.c:345  (cache_lookup CACHE_LOCK)
> 
> Unfortunately, it kind of sounds like I'm on my way to answering "why
> is this system slow?" even though it really isn't slow.  (And I rush
> to point out that the Apache process in question doesn't at any point
> in its life touch NFS, though some of the other ones on the machine
> do.)
> 
> In order to be the cause of my Apache problem, all this goobering
> around with NFS would have to be relatively infrequent but so intense
> that it shoves everything else out of the way.  I'm skeptical, but I'm
> sure one of you guys can offer a more informed opinion.
> 
> The only other thing I can think of is maybe all this is running me
> out of something I need (vnodes?) so everybody else blocks until it
> finishes and lets go of whatever finite resource it's using up?  But
> that doesn't make a ton of sense either, because why would a lack of
> vnodes cause stalls in accept() or select() in unrelated processes?
> 
> Not sure if I'm going in the right direction here or not.

Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in 
mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest 
RELENG_7.  It knows how to parse KTR_LOCK events and drop event "bars" for 
locks showing when they are held.  A more recently schedgraph.py might also 
fix the bugs you were seeing with the idle threads looking too long (esp. at 
the start and end of graphs).

-- 
John Baldwin