Date: Thu, 17 Nov 2011 01:07:38 +0200 From: Alexander Motin <mav@FreeBSD.org> To: Andriy Gapon <avg@FreeBSD.org> Cc: freebsd-current@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org> Subject: Re: Stop scheduler on panic Message-ID: <4EC4423A.3020904@FreeBSD.org> In-Reply-To: <4EC43764.1020202@FreeBSD.org> References: <20111113083215.GV50300@deviant.kiev.zoral.com.ua> <20111116202714.5ee4bd53@fabiankeil.de> <4EC43764.1020202@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
On 17.11.2011 00:21, Andriy Gapon wrote: > on 16/11/2011 21:27 Fabian Keil said the following: >> Kostik Belousov<kostikbel@gmail.com> wrote: >> >>> I was tricked into finishing the work by Andrey Gapon, who developed >>> the patch to reliably stop other processors on panic. The patch >>> greatly improves the chances of getting dump on panic on SMP host. >> >> I tested the patch trying to get a dump (from the debugger) for >> kern/162036, which currently results in the double fault reported in: >> http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html >> >> It didn't help, but also didn't make anything worse. >> >> Fabian > > The mi_switch recursion looks very familiar to me: > mi_switch() at mi_switch+0x270 > critical_exit() at critical_exit+0x9b > spinlock_exit() at spinlock_exit+0x17 > mi_switch() at mi_switch+0x275 > critical_exit() at critical_exit+0x9b > spinlock_exit() at spinlock_exit+0x17 > [several pages of the previous three lines skipped] > mi_switch() at mi_switch+0x275 > critical_exit() at critical_exit+0x9b > spinlock_exit() at spinlock_exit+0x17 > intr_even_schedule_thread() at intr_event_schedule_thread+0xbb > ahci_end_transaction() at ahci_end_transaction+0x398 > ahci_ch_intr() at ahci_ch_intr+0x2b5 > ahcipoll() at ahcipoll+0x15 > xpt_polled_action() at xpt_polled_action+0xf7 > > In fact I once discussed with jhb this recursion triggered from a different > place. To quote myself: > <avg> spinlock_exit -> critical_exit -> mi_switch -> kdb_switch -> > thread_unlock -> spinlock_exit -> critical_exit -> mi_switch -> ... > <avg> in the kdb context > <avg> this issue seems to be triggered by td_owepreempt being true at the time > kdb is entered > <avg> and there of course has to be an initial spinlock_exit call somewhere > <avg> in my case it's because of usb keyboard > <avg> I wonder if it would make sense to clear td_owepreempt right before > calling kdb_switch in mi_switch > <avg> instead of in sched_switch() > <avg> clearing td_owepreempt seems like a scheduler-independent operation to me > <avg> or is it better to just skip locking in usb when kdb_active is set > <avg> ? > > The workaround described above should work in this case. > Another possibility is to pessimize mtx_unlock_spin() implementations to check > SCHEDULER_STOPPED() and to bypass any further actions in that case. But that > would add unnecessary overhead to the sunny day code paths. > > Going further up the stack one can come up with the following proposals: > - check SCHEDULER_STOPPED() swi_sched() and return early > - do not call swi_sched() from xpt_done() if we somehow know that we are in a > polling mode There is no flag in CAM now to indicate polling mode, but if needed, it should not be difficult to add one and not call swi_sched(). -- Alexander Motin
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4EC4423A.3020904>