From owner-freebsd-stable@FreeBSD.ORG Fri May 14 14:16:49 2010 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0F5C91065673; Fri, 14 May 2010 14:16:49 +0000 (UTC) (envelope-from TERRY@tmk.com) Received: from server.tmk.com (server.tmk.com [204.141.35.63]) by mx1.freebsd.org (Postfix) with ESMTP id D4CB58FC08; Fri, 14 May 2010 14:16:48 +0000 (UTC) Received: from tmk.com by tmk.com (PMDF V6.4 #37010) id <01NN3OVWEKKW006UN1@tmk.com>; Fri, 14 May 2010 10:16:46 -0400 (EDT) Date: Fri, 14 May 2010 09:56:47 -0400 (EDT) From: Terry Kennedy In-reply-to: "Your message dated Fri, 14 May 2010 06:29:25 -0700" <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com> To: Matthew Fleming Message-id: <01NN3PQCOFHE006UN1@tmk.com> MIME-version: 1.0 Content-type: TEXT/PLAIN; charset=iso-8859-1 References: <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org, John Baldwin Subject: RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 14 May 2010 14:16:49 -0000 > > Hmm. You could try changing the code to not do a nested panic in that > > case. You would update subr_turnstile.c to just return if panicstr is > > not NULL rather than calling panic. However, there is still a good > > chance you will end up deadlocking in that case. I have another patch I > > can send you next week that prevents blocking on mutexes duing a panic > > which may also help. > > It would be instructive to know exactly why we were in turnstile(9) but > its likely due to mtx contention. > > AIX has some code at the beginning of all the locking operations to avoid > taking locks if we were running code out of kdb, though getting that worked > out was slightly tricky with our variant of mtx_assert(9). I seem to recall > there was also some "lockbusting" code that forcibly reset all owned locks > to have no owner, at least in some paths. > Given that the system is single-cpu and should be single-threaded when > dumping, this seems to me to be something worth working through to get > more reliable dumps. Except for mtx_assert(9) I cant think of a reason > to take locks once we start dumping or when in the debugger. As an aside, this is a quad-core in one package CPU (an X3363). On both this box and a similar one with an X5470, console messages continue to print out after "the system has been halted - press any key to reboot" - in particular, the shutdown makes a bunch of the "behind the scenes" man- agement stuff like the virtual keyboard and monitor appear. Plugging or unplugging USB devices will go through the whole deal of detecting and making their service available. I know the other CPUs are considered to still be running (hence the "halting other CPUs" when you press a key to reboot), but this is the first time I've seen device detection, attachment, etc. show up on the console after a shutdown. Is this behavior to be expected, or is it as unexpected as it was to me? Systems are Dell Poweredge R300's, 8-STABLE amd64. > As an aside, with terribly corrupted locks Ive seen double panics when the > attempt to print the lock name faulted in strlen(9) called for printf(9), > due to a bad lockname pointer. We have been able to get enough info off > these crashes to debug them, but its useful to remember that the system > may be in a very unstable state depending on why it panics. True. In these crashes, the system is doing essentially nothing except the one application (which, unfortunately, I don't have the source code for). The second crash happened right after booting the system, logging in, and firing off the application. It left an identical footprint (other than the 0x10 byte offset due to a recompiled kernel) from the first one, where the system had been up for 13+ hours. So, in this case I don't think there was a bunch of corruption piling up which triggered the fault, but instead the one simple operation and right away - splat! As I mentioned in the original posting, I'd be glad to give a developer complete access to the system via the remote console (Dell DRAC 5 web interface) and to the underlying FreeBSD if it'll help pin down the prob- lem. Another thing I could try (would take a couple days until I could get someone to the site) would be to try this using a bge port instead of the bce one. That might help pin it down to either something in the bce- specific code path, or somewhere else in the stack. Thanks, Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA