Date: Fri, 14 May 2010 06:29:25 -0700 From: "Matthew Fleming" <matthew.fleming@isilon.com> To: "John Baldwin" <jhb@FreeBSD.org>, "Terry Kennedy" <TERRY@tmk.com> Cc: freebsd-stable@freebsd.org Subject: RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write Message-ID: <06D5F9F6F655AD4C92E28B662F7F853E021D4D5D@seaxch09.desktop.isilon.com> References: <01NN32EOXMYC006UN1@tmk.com> <4BED3912.9080509@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
> > The crash was a "page fault while in kernel mode" with the current = process=20 > > being the interrupt service routine for the bce0 GigE. Things = progressed=20 > > reasonably until partway through the dump, when the system locked up = with a=20 > > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". = Thats the=20 > > same PID as reported in the main crash. >=20 > Hmm. You could try changing the code to not do a nested panic in that = > case. You would update subr_turnstile.c to just return if panicstr is = > not NULL rather than calling panic. However, there is still a good=20 > chance you will end up deadlocking in that case. I have another patch = I=20 > can send you next week that prevents blocking on mutexes duing a panic = > which may also help. It would be instructive to know exactly why we were in turnstile(9) but = its likely due to mtx contention. AIX has some code at the beginning of all the locking operations to = avoid taking locks if we were running code out of kdb, though getting = that worked out was slightly tricky with our variant of mtx_assert(9). = I seem to recall there was also some "lockbusting" code that forcibly = reset all owned locks to have no owner, at least in some paths. Given that the system is single-cpu and should be single-threaded when = dumping, this seems to me to be something worth working through to get = more reliable dumps. Except for mtx_assert(9) I cant think of a reason = to take locks once we start dumping or when in the debugger. As an aside, with terribly corrupted locks Ive seen double panics when = the attempt to print the lock name faulted in strlen(9) called for = printf(9), due to a bad lockname pointer. We have been able to get = enough info off these crashes to debug them, but its useful to remember = that the system may be in a very unstable state depending on why it = panics. Thanks, matthew
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?06D5F9F6F655AD4C92E28B662F7F853E021D4D5D>