Date: Fri, 12 Aug 2011 15:59:21 -0400 From: Andrew Boyer <aboyer@averesystems.com> To: Andriy Gapon <avg@FreeBSD.org>, Hans Petter Selasky <hselasky@c2i.net> Cc: Vishal.Shah@netapp.com, freebsd-stable@freebsd.org, Steven Hartland <killing@multiplay.co.uk>, Eugene Grosbein <egrosbein@rdtc.ru>, Jeremiah Lott <jlott@averesystems.com> Subject: USB/coredump hangs in 8 and 9 Message-ID: <DA1FD6FD-2E57-4EC4-899D-2C1CBB769456@averesystems.com>
next in thread | raw e-mail | index | archive | help
Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net) Re: debugging frequent kernel panics on 8.2-RELEASE (originally on = freebsd-stable) Re: System hang in USB umass module while processing panic (originally = on freebsd-usb) Hello Andriy and Hans, Sorry for tying in so many discussions on this topic, but I think I have = an explanation for the problems we have been reporting* with hanging = coredumps on multicore systems on 8.2-RELEASE, and it has implications = for Andriy's proposed scheduler patch** and for USB. In today's 8.X and 9.X branches, nothing that I can find stops the other = CPUs when the kernel panics, but many parts of the locking code get = disabled (grep on 'panicstr'). The 'bufwrite: buffer is not busy???' = panic is caused by the syncer encountering an error. If that happens = when it's on the dumping CPU everything hangs. If it's running on a = different CPU, it will be blocked and hidden by the panic_cpu spinlock = in panic(), and the dump continues, polling every attached keyboard for = a Ctl-C. But, the new 8.X USB stack relies on multithreading. (The new stack is = the variable that broke coredumps for us in the 7.1->8.2 transition, I = think.) SVN 224223 fixes a hang that would happen when dumpsys() polls = the USB keyboard (IPMI KVM, in our case). That helps, but it only gets = as far as usb_process(), where it hangs in a loop around a cv_wait() = call. This is easy to reproduce by adding code to the watchdog to break = into the debugger if panicstr is set. I am experimenting with Andriy's patch** to stop the scheduler and it = seems to be most of the way there, stopping the CPUs and disabling the = rest of locking. There are a few places that still reference panicstr, = but that's minor. These are the changes I made to the patch: * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() = is true, so that we don't hang up in USB. ukbd_yield() locks up in = DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks = up trying to drop mutexes. * Changed the call to spinlock_enter() back to critical_enter(), so = that interrupts stay enabled and the hardclock still functions. * Added code in the beginning of panic() to switch to CPU 0, so that = we're able to service the hardclock interrupts and so that watchdog = panics get through. This has worked 100% for me so far, although anyone using a USB keyboard = or dump device would still be out of luck. Thoughts? It seems like stopping all of the other CPUs is the right = thing to do on a panic (what are they doing otherwise?). Are the USB = issues fixable? If Andriy's patch get committed it might just involve = short-circuiting all of the locking in the polling path, but I haven't = gotten that far yet. I bet dumping to NFS will have the same problem. Thanks, Andrew * - http://www.freebsd.org/cgi/query-pr.cgi?pr=3Dkern/155421 ** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff -------------------------------------------------- Andrew Boyer aboyer@averesystems.com
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?DA1FD6FD-2E57-4EC4-899D-2C1CBB769456>