From owner-freebsd-current@FreeBSD.ORG  Wed Jul 29 14:08:17 2009
Return-Path: <owner-freebsd-current@FreeBSD.ORG>
Delivered-To: freebsd-current@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id D1EDD106566C;
	Wed, 29 Jul 2009 14:08:17 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from cyrus.watson.org (cyrus.watson.org [65.122.17.42])
	by mx1.freebsd.org (Postfix) with ESMTP id 925FA8FC18;
	Wed, 29 Jul 2009 14:08:17 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (66.111.2.69.static.nyinternet.net
	[66.111.2.69])
	by cyrus.watson.org (Postfix) with ESMTPSA id 2F23846B1A;
	Wed, 29 Jul 2009 10:08:17 -0400 (EDT)
Received: from jhbbsd.hudson-trading.com (unknown [209.249.190.8])
	by bigwig.baldwin.cx (Postfix) with ESMTPA id 6DEF68A0A4;
	Wed, 29 Jul 2009 10:08:16 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: freebsd-current@freebsd.org
Date: Wed, 29 Jul 2009 09:50:42 -0400
User-Agent: KMail/1.9.7
References: <746CE32B-BCF8-460A-982D-25341554E8FD@lassitu.de>
	<226F1AFF-45D8-4E4C-BE7F-D2EDC35EC8F6@lassitu.de>
	<3bbf2fe10907281943m2392a9f9w7c69303e6c3b91d0@mail.gmail.com>
In-Reply-To: <3bbf2fe10907281943m2392a9f9w7c69303e6c3b91d0@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200907290950.43842.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.0.1
	(bigwig.baldwin.cx); Wed, 29 Jul 2009 10:08:16 -0400 (EDT)
X-Virus-Scanned: clamav-milter 0.95.1 at bigwig.baldwin.cx
X-Virus-Status: Clean
X-Spam-Status: No, score=-2.5 required=4.2 tests=AWL,BAYES_00,RDNS_NONE
	autolearn=no version=3.2.5
X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on bigwig.baldwin.cx
Cc: Stefan Bethke <stb@lassitu.de>,
	Giovanni Trematerra <giovanni.trematerra@gmail.com>,
	Dan Naumov <dan.naumov@gmail.com>, Attilio Rao <attilio@freebsd.org>,
	barbara <barbara.xxx1975@libero.it>, "Bjoern A. Zeeb" <bz@freebsd.org>,
	Robert Watson <rwatson@freebsd.org>, "C. C. Tang" <hiyorin@gmail.com>
Subject: Re: spinlock held too long on reboot
X-BeenThere: freebsd-current@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Discussions about the use of FreeBSD-current
	<freebsd-current.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>, 
	<mailto:freebsd-current-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-current>
List-Post: <mailto:freebsd-current@freebsd.org>
List-Help: <mailto:freebsd-current-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-current>,
	<mailto:freebsd-current-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Wed, 29 Jul 2009 14:08:18 -0000

On Tuesday 28 July 2009 10:43:36 pm Attilio Rao wrote:
> 2009/5/23 Stefan Bethke <stb@lassitu.de>:
> > I wrote:
> >
> >> Syncing disks, vnodes remaining...0 done
> >> All buffers synced.
> >> GEOM_MIRROR: Device diesel_root: provider mirror/diesel_root destroyed.
> >> Uptime: 6m32s
> >> GEOM_MIRROR: Device diesel_root destroyed.
> >> Rebooting...
> >> cpu_reset: Stopping other CPUs
> >> spin lock 0xffffffff8078c900 (sched lock 1) held by 0xffffff00014d4ab0
> >> (tid 100002) too long
> >> panic: spin lock held too long
> >> cpuid = 0
> >> KDB: enter: panic
> >> [thread pid 77 tid 100090 ]
> >> Stopped at      kdb_enter+0x3d: movq    $0,0x48bbd0(%rip)
> >> db> bt
> >> Tracing pid 77 tid 100090 td 0xffffff000457bab0
> >> kdb_enter() at kdb_enter+0x3d
> >> panic() at panic+0x17b
> >> _mtx_lock_spin_failed() at _mtx_lock_spin_failed+0x39
> >> _mtx_lock_spin() at _mtx_lock_spin+0x9e
> >> _mtx_lock_spin_flags() at _mtx_lock_spin_flags+0x72
> >> sched_balance_group() at sched_balance_group+0xc5
> >> sched_balance_group() at sched_balance_group+0x1f8
> >> sched_balance() at sched_balance+0xa2
> >> sched_clock() at sched_clock+0xf6
> >> statclock() at statclock+0xbd
> >> lapic_handle_timer() at lapic_handle_timer+0x197
> >> Xtimerint() at Xtimerint+0x8c
> >> --- interrupt, rip = 0xffffffff80541cc4, rsp = 0xffffff80771dba90, rbp =
> >> 0xffffff80771dbab0 ---
> >> DELAY() at DELAY+0x64
> >> cpu_reset() at cpu_reset+0xdd
> >> boot() at boot+0x2e6
> >> reboot() at reboot+0x42
> >> syscall() at syscall+0x1a5
> >> Xfast_syscall() at Xfast_syscall+0xd0
> >> --- syscall (55, FreeBSD ELF64, reboot), rip = 0x800788eec, rsp =
> >> 0x7fffffffeca8, rbp = 0 ---
> >
> >
> > I've only seen this once.  If I should encounter it again, is there
> > something you'd like me to look at?
> 
> [ Sorry, trying to add anyone who alredy reported such a problem even
> if I know many of you experienced it on -STABLE]
> 
> Could you try this patch against -CURRENT:
> http://www.freebsd.org/~attilio/stop_nmi.diff
> 
> This patch basically does 2 things:
> 1) Removing the STOP_NMI option, and adding the infrastructure for
> using NMI on KDB invocation and normal stop IPIs on standard cpu
> shutdown.
> In order to accomplish that and forsee a better design than what
> STOP_NMI does now, 2 new functions are introduced: *
> ipi_hstop_selected() which does, if the architecture offers such an
> option, the possibility to send a "forced" IPI through a privileged
> channel (NMI on amd64 and ia32) in order to stop CPUs passed in the
> mask.  Note that for the other architectures that are not amd64 and
> ia32 ipi_hstop_selected() is defaulted to ipi_selected(..., STOP_IPI),
> but if maintainers want to override that they can simply implement
> something harder

Why not just add a new IPI_STOP_HARD that maps to IPI_STOP on most archs and 
does the NMI logic on x86.  This avoids adding a new API 
(ipi_hstop_selected()) instead just adding a new logical IPI.

-- 
John Baldwin