From owner-freebsd-current@FreeBSD.ORG Thu Nov 17 08:41:03 2011 Return-Path: Delivered-To: freebsd-current@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id C457D106564A; Thu, 17 Nov 2011 08:41:03 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-bw0-f54.google.com (mail-bw0-f54.google.com [209.85.214.54]) by mx1.freebsd.org (Postfix) with ESMTP id 27EEA8FC16; Thu, 17 Nov 2011 08:41:02 +0000 (UTC) Received: by bkbzs8 with SMTP id zs8so2271366bkb.13 for ; Thu, 17 Nov 2011 00:41:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=sender:message-id:date:from:user-agent:mime-version:to:cc:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=PPTq1XWz/3W4k/WOeHXZ7HvC7+rMNCWNX2wDqDEfEVs=; b=IiyX4jb+r3J63pFKHU0w9Vsv9OvbUW2+3BGX3vqygqbI8spRHL8UeFI7eh2MZgcbj5 z+HDKyx04W6lFunjddq/n2uXNhCSpbCBcgWvPi1+bBQwcGbBEVWxkAyWwwfC1VotfX8C 7OcIbmMXbEO0WQJ+MfvE4jfYwHmargeuYRcNE= Received: by 10.204.136.211 with SMTP id s19mr25748615bkt.28.1321519261272; Thu, 17 Nov 2011 00:41:01 -0800 (PST) Received: from mavbook2.mavhome.dp.ua (pc.mavhome.dp.ua. [212.86.226.226]) by mx.google.com with ESMTPS id y10sm1456420fal.10.2011.11.17.00.40.59 (version=TLSv1/SSLv3 cipher=OTHER); Thu, 17 Nov 2011 00:41:00 -0800 (PST) Sender: Alexander Motin Message-ID: <4EC4C89A.2040208@FreeBSD.org> Date: Thu, 17 Nov 2011 10:40:58 +0200 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:7.0.1) Gecko/20111003 Thunderbird/7.0.1 MIME-Version: 1.0 To: Kostik Belousov References: <20111113083215.GV50300@deviant.kiev.zoral.com.ua> <20111116202714.5ee4bd53@fabiankeil.de> <4EC43764.1020202@FreeBSD.org> <4EC4423A.3020904@FreeBSD.org> <20111117081533.GP50300@deviant.kiev.zoral.com.ua> In-Reply-To: <20111117081533.GP50300@deviant.kiev.zoral.com.ua> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Cc: freebsd-current@freebsd.org, Andriy Gapon Subject: Re: Stop scheduler on panic X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 17 Nov 2011 08:41:03 -0000 On 11/17/11 10:15, Kostik Belousov wrote: > On Thu, Nov 17, 2011 at 01:07:38AM +0200, Alexander Motin wrote: >> On 17.11.2011 00:21, Andriy Gapon wrote: >>> on 16/11/2011 21:27 Fabian Keil said the following: >>>> Kostik Belousov wrote: >>>> >>>>> I was tricked into finishing the work by Andrey Gapon, who developed >>>>> the patch to reliably stop other processors on panic. The patch >>>>> greatly improves the chances of getting dump on panic on SMP host. >>>> >>>> I tested the patch trying to get a dump (from the debugger) for >>>> kern/162036, which currently results in the double fault reported in: >>>> http://lists.freebsd.org/pipermail/freebsd-current/2011-September/027766.html >>>> >>>> It didn't help, but also didn't make anything worse. >>>> >>>> Fabian >>> >>> The mi_switch recursion looks very familiar to me: >>> mi_switch() at mi_switch+0x270 >>> critical_exit() at critical_exit+0x9b >>> spinlock_exit() at spinlock_exit+0x17 >>> mi_switch() at mi_switch+0x275 >>> critical_exit() at critical_exit+0x9b >>> spinlock_exit() at spinlock_exit+0x17 >>> [several pages of the previous three lines skipped] >>> mi_switch() at mi_switch+0x275 >>> critical_exit() at critical_exit+0x9b >>> spinlock_exit() at spinlock_exit+0x17 >>> intr_even_schedule_thread() at intr_event_schedule_thread+0xbb >>> ahci_end_transaction() at ahci_end_transaction+0x398 >>> ahci_ch_intr() at ahci_ch_intr+0x2b5 >>> ahcipoll() at ahcipoll+0x15 >>> xpt_polled_action() at xpt_polled_action+0xf7 >>> >>> In fact I once discussed with jhb this recursion triggered from a different >>> place. To quote myself: >>> spinlock_exit -> critical_exit -> mi_switch -> kdb_switch -> >>> thread_unlock -> spinlock_exit -> critical_exit -> mi_switch -> ... >>> in the kdb context >>> this issue seems to be triggered by td_owepreempt being true at >>> the time >>> kdb is entered >>> and there of course has to be an initial spinlock_exit call >>> somewhere >>> in my case it's because of usb keyboard >>> I wonder if it would make sense to clear td_owepreempt right >>> before >>> calling kdb_switch in mi_switch >>> instead of in sched_switch() >>> clearing td_owepreempt seems like a scheduler-independent >>> operation to me >>> or is it better to just skip locking in usb when kdb_active is set >>> ? >>> >>> The workaround described above should work in this case. >>> Another possibility is to pessimize mtx_unlock_spin() implementations to >>> check >>> SCHEDULER_STOPPED() and to bypass any further actions in that case. But >>> that >>> would add unnecessary overhead to the sunny day code paths. >>> >>> Going further up the stack one can come up with the following proposals: >>> - check SCHEDULER_STOPPED() swi_sched() and return early >>> - do not call swi_sched() from xpt_done() if we somehow know that we are >>> in a >>> polling mode >> >> There is no flag in CAM now to indicate polling mode, but if needed, it >> should not be difficult to add one and not call swi_sched(). > > I have the following change for eons on my test boxes. Without it, > I simply cannot get _any_ dump. > > diff --git a/sys/cam/cam_xpt.c b/sys/cam/cam_xpt.c > index 10b89c7..a38e42f 100644 > --- a/sys/cam/cam_xpt.c > +++ b/sys/cam/cam_xpt.c > @@ -4230,7 +4230,7 @@ xpt_done(union ccb *done_ccb) > TAILQ_INSERT_TAIL(&cam_simq, sim, links); > mtx_unlock(&cam_simq_lock); > sim->flags |= CAM_SIM_ON_DONEQ; > - if (first) > + if (first && panicstr == NULL) > swi_sched(cambio_ih, 0); > } > } That should be OK for kernel dumping. I was thinking about CAM abusing polling not only for dumping. But looking on cases where it does it now, may be it is better to rewrite them instead. -- Alexander Motin