FreeBSD Mail Archives

Date:      Fri, 14 May 2010 06:29:25 -0700
From:      "Matthew Fleming" <matthew.fleming@isilon.com>
To:        "John Baldwin" <jhb@FreeBSD.org>, "Terry Kennedy" <TERRY@tmk.com>
Cc:        freebsd-stable@freebsd.org
Subject:   RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
Message-ID:  <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com>
References:  <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail


> > The crash was a "page fault while in kernel mode" with the current process 
> > being the interrupt service routine for the bce0 GigE. Things progressed 
> > reasonably until partway through the dump, when the system locked up with a 
> > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". Thats the 
> > same PID as reported in the main crash.
> 
> Hmm.  You could try changing the code to not do a nested panic in that 
> case.  You would update subr_turnstile.c to just return if panicstr is 
> not NULL rather than calling panic.  However, there is still a good 
> chance you will end up deadlocking in that case.  I have another patch I 
> can send you next week that prevents blocking on mutexes duing a panic 
> which may also help.

It would be instructive to know exactly why we were in turnstile(9) but its likely due to mtx contention.

AIX has some code at the beginning of all the locking operations to avoid taking locks if we were running code out of kdb, though getting that worked out was slightly tricky with our variant of mtx_assert(9).  I seem to recall there was also some "lockbusting" code that forcibly reset all owned locks to have no owner, at least in some paths.

Given that the system is single-cpu and should be single-threaded when dumping, this seems to me to be something worth working through to get more reliable dumps.  Except for mtx_assert(9) I cant think of a reason to take locks once we start dumping or when in the debugger.

As an aside, with terribly corrupted locks Ive seen double panics when the attempt to print the lock name faulted in strlen(9) called for printf(9), due to a bad lockname pointer.  We have been able to get enough info off these crashes to debug them, but its useful to remember that the system may be in a very unstable state depending on why it panics.

Thanks,
matthew

home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?06D5F9F6F655AD4C92E28B662F7F853E021D4D5D>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation