From owner-freebsd-stable@FreeBSD.ORG  Fri May 14 14:16:49 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@FreeBSD.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0F5C91065673;
	Fri, 14 May 2010 14:16:49 +0000 (UTC) (envelope-from TERRY@tmk.com)
Received: from server.tmk.com (server.tmk.com [204.141.35.63])
	by mx1.freebsd.org (Postfix) with ESMTP id D4CB58FC08;
	Fri, 14 May 2010 14:16:48 +0000 (UTC)
Received: from tmk.com by tmk.com (PMDF V6.4 #37010)
	id <01NN3OVWEKKW006UN1@tmk.com>; Fri, 14 May 2010 10:16:46 -0400 (EDT)
Date: Fri, 14 May 2010 09:56:47 -0400 (EDT)
From: Terry Kennedy <TERRY@tmk.com>
In-reply-to: "Your message dated Fri, 14 May 2010 06:29:25 -0700"
	<06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com>
To: Matthew Fleming <matthew.fleming@isilon.com>
Message-id: <01NN3PQCOFHE006UN1@tmk.com>
MIME-version: 1.0
Content-type: TEXT/PLAIN; charset=iso-8859-1
References: <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org>
Cc: freebsd-stable@FreeBSD.org, John Baldwin <jhb@FreeBSD.org>
Subject: RE: Crash dump problem - sleeping thread owns a non-sleepable lock
 during crash dump write
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 May 2010 14:16:49 -0000

> > Hmm.  You could try changing the code to not do a nested panic in that
> > case.  You would update subr_turnstile.c to just return if panicstr is
> > not NULL rather than calling panic.  However, there is still a good
> > chance you will end up deadlocking in that case.  I have another patch I
> > can send you next week that prevents blocking on mutexes duing a panic
> > which may also help.
>
> It would be instructive to know exactly why we were in turnstile(9) but 
> its likely due to mtx contention.
>

> AIX has some code at the beginning of all the locking operations to avoid 
> taking locks if we were running code out of kdb, though getting that worked 
> out was slightly tricky with our variant of mtx_assert(9).  I seem to recall
> there was also some "lockbusting" code that forcibly reset all owned locks 
> to have no owner, at least in some paths.

> Given that the system is single-cpu and should be single-threaded when 
> dumping, this seems to me to be something worth working through to get 
> more reliable dumps.  Except for mtx_assert(9) I cant think of a reason 
> to take locks once we start dumping or when in the debugger.

  As an aside, this is a quad-core in one package CPU (an X3363). On both
this box and a similar one with an X5470, console messages continue to
print out after "the system has been halted - press any key to reboot" -
in particular, the shutdown makes a bunch of the "behind the scenes" man-
agement stuff like the virtual keyboard and monitor appear. Plugging or
unplugging USB devices will go through the whole deal of detecting and
making their service available.

  I know the other CPUs are considered to still be running (hence the
"halting other CPUs" when you press a key to reboot), but this is the
first time I've seen device detection, attachment, etc. show up on the
console after a shutdown.

  Is this behavior to be expected, or is it as unexpected as it was to
me? Systems are Dell Poweredge R300's, 8-STABLE amd64.

> As an aside, with terribly corrupted locks Ive seen double panics when the 
> attempt to print the lock name faulted in strlen(9) called for printf(9), 
> due to a bad lockname pointer.  We have been able to get enough info off 
> these crashes to debug them, but its useful to remember that the system 
> may be in a very unstable state depending on why it panics.

  True. In these crashes, the system is doing essentially nothing except
the one application (which, unfortunately, I don't have the source code
for). The second crash happened right after booting the system, logging in,
and firing off the application. It left an identical footprint (other than
the 0x10 byte offset due to a recompiled kernel) from the first one, where
the system had been up for 13+ hours.

  So, in this case I don't think there was a bunch of corruption piling up
which triggered the fault, but instead the one simple operation and right
away - splat!

  As I mentioned in the original posting, I'd be glad to give a developer
complete access to the system via the remote console (Dell DRAC 5 web
interface) and to the underlying FreeBSD if it'll help pin down the prob-
lem.

  Another thing I could try (would take a couple days until I could get
someone to the site) would be to try this using a bge port instead of
the bce one. That might help pin it down to either something in the bce-
specific code path, or somewhere else in the stack.

	Thanks,
        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA