From owner-freebsd-arch@freebsd.org  Fri Aug 21 15:41:34 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9BA019BFD63;
 Fri, 21 Aug 2015 15:41:34 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from smtp.vangyzen.net (hotblack.vangyzen.net
 [IPv6:2607:fc50:1000:7400:216:3eff:fe72:314f])
 by mx1.freebsd.org (Postfix) with ESMTP id 7E7FB8;
 Fri, 21 Aug 2015 15:41:34 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from marvin.beer.town (unknown [76.164.8.130])
 by smtp.vangyzen.net (Postfix) with ESMTPSA id 08DC756486;
 Fri, 21 Aug 2015 10:41:32 -0500 (CDT)
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
To: Adrian Chadd <adrian@freebsd.org>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
 <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
 <55D74193.4020008@FreeBSD.org>
 <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG+DdZ3fZdvFnan06g@mail.gmail.com>
Cc: Ryan Stone <rysto32@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Scott Long <scottl@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
From: Eric van Gyzen <vangyzen@FreeBSD.org>
X-Enigmail-Draft-Status: N1110
Message-ID: <55D746AB.6040001@FreeBSD.org>
Date: Fri, 21 Aug 2015 10:41:31 -0500
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG+DdZ3fZdvFnan06g@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 15:41:34 -0000

Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled).  :-/

Eric

On 08/21/2015 10:25, Adrian Chadd wrote:
> Ah, cool. I'll give it a whirl.
> 
> I'm a little worried about having all of the other cores spinning in
> this case (mostly thermal; the machines get VERY LOUD when the CPUs
> are spinning..)
> 
> 
> -a
> 
> 
> On 21 August 2015 at 08:19, Eric van Gyzen <vangyzen@freebsd.org> wrote:
>> I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
>>
>> Ryan is exactly right.  There was a thread a while ago, with a proposed patch from Kostik:
>>
>> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
>>
>> As I recall, Scott Long also ran into this a few months ago.
>>
>> It happens for any NMI:  entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
>>
>> Eric
>>
>> On 08/21/2015 09:23, Ryan Stone wrote:
>>> I have seen similar behaviour before.  The problem is that every CPU
>>> receives an NMI concurrently.  As I recall, one of them gets some kind of
>>> pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
>>> because they are already in an NMI handler, they don't get the second NMI
>>> and don't stop properly.
>>>
>>> The case that I saw actually had to do with a panic triggered by an NMI,
>>> not entering the debugger, but I believe that both cases use
>>> stop_cpus_hard() under the hood and have a similar issue.
>>>
>>> (I also recall seeing the exact situation that you describe while
>>> originally developing SR-IOV on an alpha version of the Fortville hardware
>>> and firmware with a very buggy SR-IOV implementation.  I've never seen it
>>> on ixgbe before, although I haven't used SR-IOV there very much at all)
>>>
>>>
>>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> This has started happening on -HEAD recently. No, I don't have any
>>>> more details yet than "recently."
>>>>
>>>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>>>> have any ideas?
>>>>
>>>>
>>>> -adrian
>>
>