Date: Fri, 14 May 2010 07:59:40 -0400 (EDT) From: Terry Kennedy <TERRY@tmk.com> To: John Baldwin <jhb@FreeBSD.org> Cc: freebsd-stable@FreeBSD.org Subject: Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write Message-ID: <01NN3LDWWAQ6006QOF@tmk.com> In-Reply-To: "Your message dated Fri, 14 May 2010 07:50:42 -0400" <4BED3912.9080509@FreeBSD.org> References: <01NN32EOXMYC006UN1@tmk.com>
next in thread | previous in thread | raw e-mail | index | archive | help
> > The crash was a "page fault while in kernel mode" with the current process > > being the interrupt service routine for the bce0 GigE. Things progressed > > reasonably until partway through the dump, when the system locked up with a > > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the > > same PID as reported in the main crash. > > Hmm. You could try changing the code to not do a nested panic in that > case. You would update subr_turnstile.c to just return if panicstr is > not NULL rather than calling panic. However, there is still a good > chance you will end up deadlocking in that case. I have another patch I > can send you next week that prevents blocking on mutexes duing a panic > which may also help. Ok, I'll be glad to try that. > > 3) Is there any way to rig the system to obtain more info if this happens > > again? Right now I'm using an embedded remote console server, but I could > > switch the system to a serial port if enabling the kernel debugger might help. > > But I think that the sleeping thread bit would happen even at the debugger > > prompt, wouldn't it? > > Include DDB and enable the 'trace_on_panic' sysctl knob perhaps. Hmmm. Do you think it will get very far before the sleeping thread business locks it up? > > Is it possible to correlate the source line in the kernel with the instruction > > pointer in the panic? > > If you are booted into the same kernel with the same modules loaded, you > can probably run 'kgdb' as root do 'l *<instruction pointer>'. I did that and discovered that the 0x20: prefix is probably unwanted: (kgdb) l *0x20:0xffffffff801e3c06 A syntax error in expression, near `:0xffffffff801e3c06'. (kgdb) l *0xffffffff801e3c06 0xffffffff801e3c06 is in bce_start_locked (/usr/src/sys/dev/bce/if_bce.c:6996). 6991 } 6992 6993 count++; 6994 6995 /* Send a copy of the frame to any BPF listeners. */ 6996 ETHER_BPF_MTAP(ifp, m_head); 6997 } 6998 6999 /* Exit if no packets were dequeued. */ 7000 if (count == 0) { (kgdb) This kernel does have BPF compiled in, but I don't think it was in use at the time. Any further suggestions to look at (remember, this system is in another state from me and all I have is remote access to the framebuffer - I'd have to go there and set up a serial console to be able to talk to the debugger if it crashes). Thanks, Terry Kennedy http://www.tmk.com terry@tmk.com New York, NY USA
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?01NN3LDWWAQ6006QOF>