Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 17 Nov 2011 01:07:38 +0200
From:      Alexander Motin <mav@FreeBSD.org>
To:        Andriy Gapon <avg@FreeBSD.org>
Cc:        freebsd-current@FreeBSD.org, Konstantin Belousov <kib@FreeBSD.org>
Subject:   Re: Stop scheduler on panic
Message-ID:  <4EC4423A.3020904@FreeBSD.org>
In-Reply-To: <4EC43764.1020202@FreeBSD.org>
References:  <20111113083215.GV50300@deviant.kiev.zoral.com.ua> <20111116202714.5ee4bd53@fabiankeil.de> <4EC43764.1020202@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On 17.11.2011 00:21, Andriy Gapon wrote:
> on 16/11/2011 21:27 Fabian Keil said the following:
>> Kostik Belousov<kostikbel@gmail.com>  wrote:
>>
>>> I was tricked into finishing the work by Andrey Gapon, who developed
>>> the patch to reliably stop other processors on panic.  The patch
>>> greatly improves the chances of getting dump on panic on SMP host.
>>
>> I tested the patch trying to get a dump (from the debugger) for
>> kern/162036, which currently results in the double fault reported in:
>> http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html
>>
>> It didn't help, but also didn't make anything worse.
>>
>> Fabian
>
> The mi_switch recursion looks very familiar to me:
> mi_switch() at mi_switch+0x270
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> mi_switch() at mi_switch+0x275
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> [several pages of the previous three lines skipped]
> mi_switch() at mi_switch+0x275
> critical_exit() at critical_exit+0x9b
> spinlock_exit() at spinlock_exit+0x17
> intr_even_schedule_thread() at intr_event_schedule_thread+0xbb
> ahci_end_transaction() at ahci_end_transaction+0x398
> ahci_ch_intr() at ahci_ch_intr+0x2b5
> ahcipoll() at ahcipoll+0x15
> xpt_polled_action() at xpt_polled_action+0xf7
>
> In fact I once discussed with jhb this recursion triggered from a different
> place.  To quote myself:
> <avg>    spinlock_exit ->  critical_exit ->  mi_switch ->  kdb_switch ->
> thread_unlock ->  spinlock_exit ->  critical_exit ->  mi_switch ->  ...
> <avg>    in the kdb context
> <avg>    this issue seems to be triggered by td_owepreempt being true at the time
> kdb is entered
> <avg>    and there of course has to be an initial spinlock_exit call somewhere
> <avg>    in my case it's because of usb keyboard
> <avg>    I wonder if it would make sense to clear td_owepreempt right before
> calling kdb_switch in mi_switch
> <avg>    instead of in sched_switch()
> <avg>    clearing td_owepreempt seems like a scheduler-independent operation to me
> <avg>    or is it better to just skip locking in usb when kdb_active is set
> <avg>    ?
>
> The workaround described above should work in this case.
> Another possibility is to pessimize mtx_unlock_spin() implementations to check
> SCHEDULER_STOPPED() and to bypass any further actions in that case.  But that
> would add unnecessary overhead to the sunny day code paths.
>
> Going further up the stack one can come up with the following proposals:
> - check SCHEDULER_STOPPED() swi_sched() and return early
> - do not call swi_sched() from xpt_done() if we somehow know that we are in a
> polling mode

There is no flag in CAM now to indicate polling mode, but if needed, it 
should not be difficult to add one and not call swi_sched().

-- 
Alexander Motin



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4EC4423A.3020904>