From owner-freebsd-hackers@FreeBSD.ORG Fri Sep 11 15:19:28 2009 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 00660106566C for ; Fri, 11 Sep 2009 15:19:27 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42]) by mx1.freebsd.org (Postfix) with ESMTP id B04AB8FC1A for ; Fri, 11 Sep 2009 15:19:27 +0000 (UTC) Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net [66.111.2.69]) by cyrus.watson.org (Postfix) with ESMTPSA id 40AC646B03; Fri, 11 Sep 2009 11:19:27 -0400 (EDT) Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8]) by bigwig.baldwin.cx (Postfix) with ESMTPA id 772D48A01B; Fri, 11 Sep 2009 11:19:26 -0400 (EDT) From: John Baldwin To: freebsd-hackers@freebsd.org Date: Fri, 11 Sep 2009 11:02:14 -0400 User-Agent: KMail/1.9.7 References: <237c27100908261203g7e771400o2d9603220d1f1e0b@mail.gmail.com> <237c27100909101207q73f0c513r60dd5ab83fdfd083@mail.gmail.com> <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> In-Reply-To: <237c27100909101834g49438707l96fa58df5f717945@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200909111102.14503.jhb@freebsd.org> X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1 (bigwig.baldwin.cx); Fri, 11 Sep 2009 11:19:26 -0400 (EDT) X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE autolearn=no version=3.2.5 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx Cc: Linda Messerschmidt Subject: Re: Intermittent system hangs on 7.2-RELEASE-p1 X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 Sep 2009 15:19:28 -0000 On Thursday 10 September 2009 9:34:30 pm Linda Messerschmidt wrote: > Just to follow up, I've been doing some testing with masking for > KTR_LOCK rather than KTR_SCHED. > > I'm having trouble with this because I have the KTR buffer size set to > 1048576 entries, and with only KTR_LOCK enabled, this isn't enough for > even a full second of tracing; the sample I'm working with now is just > under 0.9s. It's an average of one entry every 2001 TSC ticks. That > *seems* like a lot of locking activity, but some of the lock points > are only a couple of lines apart, so maybe it's just incredibly > verbose. > > Since it's so much data and I'm still working on a way to correlate it > (lockgraph.py?), all I've got so far is a list of what trace points > are coming up the most: > > 51927 src/sys/kern/kern_lock.c:215 (_lockmgr UNLOCK mtx_unlock() when > flags & LK_INTERLOCK) > 48033 src/sys/kern/vfs_subr.c:2284 (vdropl UNLOCK) > 41548 src/sys/kern/vfs_subr.c:2187 (vput VI_LOCK) > 29359 src/sys/kern/vfs_subr.c:2067 (vget VI_LOCK) > 29358 src/sys/kern/vfs_subr.c:2079 (vget VI_UNLOCK) > 23799 src/sys/nfsclient/nfs_subs.c:755 (nfs_getattrcache mtx_lock) > 23460 src/sys/nfsclient/nfs_vnops.c:645 (nfs_getattr mtx_unlock) > 23460 src/sys/nfsclient/nfs_vnops.c:642 (nfs_getattr mtx_lock) > 23460 src/sys/nfsclient/nfs_subs.c:815 (nfs_getattrcache mtx_unlock) > 23138 src/sys/kern/vfs_cache.c:345 (cache_lookup CACHE_LOCK) > > Unfortunately, it kind of sounds like I'm on my way to answering "why > is this system slow?" even though it really isn't slow. (And I rush > to point out that the Apache process in question doesn't at any point > in its life touch NFS, though some of the other ones on the machine > do.) > > In order to be the cause of my Apache problem, all this goobering > around with NFS would have to be relatively infrequent but so intense > that it shoves everything else out of the way. I'm skeptical, but I'm > sure one of you guys can offer a more informed opinion. > > The only other thing I can think of is maybe all this is running me > out of something I need (vnodes?) so everybody else blocks until it > finishes and lets go of whatever finite resource it's using up? But > that doesn't make a ton of sense either, because why would a lack of > vnodes cause stalls in accept() or select() in unrelated processes? > > Not sure if I'm going in the right direction here or not. Try turning off KTR_LOCK for spin mutexes (just force LO_QUIET on in mtx_init() if MTX_SPIN is set) and use a schedgraph.py from the latest RELENG_7. It knows how to parse KTR_LOCK events and drop event "bars" for locks showing when they are held. A more recently schedgraph.py might also fix the bugs you were seeing with the idle threads looking too long (esp. at the start and end of graphs). -- John Baldwin