From owner-freebsd-stable@FreeBSD.ORG Fri May 14 13:29:26 2010 Return-Path: Delivered-To: freebsd-stable@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 9D6B31065670; Fri, 14 May 2010 13:29:26 +0000 (UTC) (envelope-from matthew.fleming@isilon.com) Received: from seaxch09.isilon.com (seaxch09.isilon.com [74.85.160.25]) by mx1.freebsd.org (Postfix) with ESMTP id 7C0BB8FC0A; Fri, 14 May 2010 13:29:26 +0000 (UTC) X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Date: Fri, 14 May 2010 06:29:25 -0700 Message-ID: <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com> X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write Thread-Index: AcrzXGrdo0fQ/XSoSiSYh2lkHhaY0wADA90D References: <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org> From: "Matthew Fleming" To: "John Baldwin" , "Terry Kennedy" Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.5 Cc: freebsd-stable@freebsd.org Subject: RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 May 2010 13:29:26 -0000 > > The crash was a "page fault while in kernel mode" with the current = process=20 > > being the interrupt service routine for the bce0 GigE. Things = progressed=20 > > reasonably until partway through the dump, when the system locked up = with a=20 > > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". = Thats the=20 > > same PID as reported in the main crash. >=20 > Hmm. You could try changing the code to not do a nested panic in that = > case. You would update subr_turnstile.c to just return if panicstr is = > not NULL rather than calling panic. However, there is still a good=20 > chance you will end up deadlocking in that case. I have another patch = I=20 > can send you next week that prevents blocking on mutexes duing a panic = > which may also help. It would be instructive to know exactly why we were in turnstile(9) but = its likely due to mtx contention. AIX has some code at the beginning of all the locking operations to = avoid taking locks if we were running code out of kdb, though getting = that worked out was slightly tricky with our variant of mtx_assert(9). = I seem to recall there was also some "lockbusting" code that forcibly = reset all owned locks to have no owner, at least in some paths. Given that the system is single-cpu and should be single-threaded when = dumping, this seems to me to be something worth working through to get = more reliable dumps. Except for mtx_assert(9) I cant think of a reason = to take locks once we start dumping or when in the debugger. As an aside, with terribly corrupted locks Ive seen double panics when = the attempt to print the lock name faulted in strlen(9) called for = printf(9), due to a bad lockname pointer. We have been able to get = enough info off these crashes to debug them, but its useful to remember = that the system may be in a very unstable state depending on why it = panics. Thanks, matthew