From owner-freebsd-stable@FreeBSD.ORG  Fri May 14 13:29:26 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 9D6B31065670;
	Fri, 14 May 2010 13:29:26 +0000 (UTC)
	(envelope-from matthew.fleming@isilon.com)
Received: from seaxch09.isilon.com (seaxch09.isilon.com [74.85.160.25])
	by mx1.freebsd.org (Postfix) with ESMTP id 7C0BB8FC0A;
	Fri, 14 May 2010 13:29:26 +0000 (UTC)
X-MimeOLE: Produced By Microsoft Exchange V6.5
Content-class: urn:content-classes:message
MIME-Version: 1.0
Date: Fri, 14 May 2010 06:29:25 -0700
Message-ID: <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Crash dump problem - sleeping thread owns a non-sleepable lock
	during crash dump write
Thread-Index: AcrzXGrdo0fQ/XSoSiSYh2lkHhaY0wADA90D
References: <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org>
From: "Matthew Fleming" <matthew.fleming@isilon.com>
To: "John Baldwin" <jhb@FreeBSD.org>,
	"Terry Kennedy" <TERRY@tmk.com>
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
X-Content-Filtered-By: Mailman/MimeDel 2.1.5
Cc: freebsd-stable@freebsd.org
Subject: RE: Crash dump problem - sleeping thread owns a non-sleepable lock
	during crash dump write
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 May 2010 13:29:26 -0000

> > The crash was a "page fault while in kernel mode" with the current =
process=20
> > being the interrupt service routine for the bce0 GigE. Things =
progressed=20
> > reasonably until partway through the dump, when the system locked up =
with a=20
> > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". =
Thats the=20
> > same PID as reported in the main crash.
>=20
> Hmm.  You could try changing the code to not do a nested panic in that =

> case.  You would update subr_turnstile.c to just return if panicstr is =

> not NULL rather than calling panic.  However, there is still a good=20
> chance you will end up deadlocking in that case.  I have another patch =
I=20
> can send you next week that prevents blocking on mutexes duing a panic =

> which may also help.

It would be instructive to know exactly why we were in turnstile(9) but =
its likely due to mtx contention.

AIX has some code at the beginning of all the locking operations to =
avoid taking locks if we were running code out of kdb, though getting =
that worked out was slightly tricky with our variant of mtx_assert(9).  =
I seem to recall there was also some "lockbusting" code that forcibly =
reset all owned locks to have no owner, at least in some paths.

Given that the system is single-cpu and should be single-threaded when =
dumping, this seems to me to be something worth working through to get =
more reliable dumps.  Except for mtx_assert(9) I cant think of a reason =
to take locks once we start dumping or when in the debugger.

As an aside, with terribly corrupted locks Ive seen double panics when =
the attempt to print the lock name faulted in strlen(9) called for =
printf(9), due to a bad lockname pointer.  We have been able to get =
enough info off these crashes to debug them, but its useful to remember =
that the system may be in a very unstable state depending on why it =
panics.

Thanks,
matthew