From owner-freebsd-stable@FreeBSD.ORG  Fri May 14 03:37:37 2010
Return-Path: <owner-freebsd-stable@FreeBSD.ORG>
Delivered-To: freebsd-stable@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id A0EAE1065670
	for <freebsd-stable@freebsd.org>; Fri, 14 May 2010 03:37:37 +0000 (UTC)
	(envelope-from TERRY@tmk.com)
Received: from server.tmk.com (server.tmk.com [204.141.35.63])
	by mx1.freebsd.org (Postfix) with ESMTP id 7C7378FC08
	for <freebsd-stable@freebsd.org>; Fri, 14 May 2010 03:37:37 +0000 (UTC)
Received: from tmk.com by tmk.com (PMDF V6.4 #37010)
	id <01NN3295HL34006UN1@tmk.com> for freebsd-stable@freebsd.org; Thu,
	13 May 2010 23:08:25 -0400 (EDT)
Date: Thu, 13 May 2010 23:04:41 -0400 (EDT)
From: Terry Kennedy <TERRY@tmk.com>
To: freebsd-stable@freebsd.org
Message-id: <01NN32EOXMYC006UN1@tmk.com>
MIME-version: 1.0
Content-type: TEXT/PLAIN; CHARSET=us-ascii
Subject: Crash dump problem - sleeping thread owns a non-sleepable lock
 during crash dump write
X-BeenThere: freebsd-stable@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Production branch of FreeBSD source code <freebsd-stable.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>, 
	<mailto:freebsd-stable-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-stable>
List-Post: <mailto:freebsd-stable@freebsd.org>
List-Help: <mailto:freebsd-stable-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-stable>,
	<mailto:freebsd-stable-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 14 May 2010 03:37:37 -0000

  I'm reposting this over here at the suggestion of the Forums moderator.
The original post is at http://forums.freebsd.org/showthread.php?t=14163

Got an interesting crash just now (well, as interesting as a crash on a 
soon-to-be production system can be).

This is 8-STABLE/amd64, last cvsup'd early in the morning of May 9th.

The system didn't complete the crash dump, so it needed a manual reset to get 
it going again.

The crash was a "page fault while in kernel mode" with the current process 
being the interrupt service routine for the bce0 GigE. Things progressed 
reasonably until partway through the dump, when the system locked up with a 
"Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the 
same PID as reported in the main crash.

Screen capture at http://www.tmk.com/transient/crash-20100513002317.png
Complete dmesg, etc. available on request.

As I mentioned above, the system needed a hard reset to get going again. 
savecore doesn't think there's a usable dump, so I don't think there's any
more info to gather.

I just cvsup'd the box and built a new kernel, in case the previous cvsup was 
in between related commits, or to see if anything changed since. I still have 
the old kernel around in case any useful info can be gathered from it.

So, a couple questions:

1) Anything known to be funky w/ bce?

2) Should the part of the system that caused the panic be able to lock up the 
crash dump process? Obviously, if the disk driver causes a panic, all bets are 
off when trying to use it to write the dump, but this crash seems to have been 
from a network driver. Shouldn't a double panic just give up on the dump and 
try a reboot?

3) Is there any way to rig the system to obtain more info if this happens 
again? Right now I'm using an embedded remote console server, but I could 
switch the system to a serial port if enabling the kernel debugger might help. 
But I think that the sleeping thread bit would happen even at the debugger 
prompt, wouldn't it? 

I just booted the new kernel and tried this again, and got another crash. The 
message is identical to the first, except that the instruction pointer changed 
by 0x10 (presumably due to code differences between the old and new kernels) 
and it got 6MB further writing the crash dump.

Since it seems I can reproduce this at will, I'll be glad to either perform 
additional information-gathering or give a developer access to the box for 
testing purposes.

Is it possible to correlate the source line in the kernel with the instruction 
pointer in the panic? 

        Terry Kennedy             http://www.tmk.com
        terry@tmk.com             New York, NY USA