Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 21 Aug 2015 16:31:13 +0100
From:      Scott Long <scottl@netflix.com>
To:        Eric van Gyzen <vangyzen@FreeBSD.org>
Cc:        Ryan Stone <rysto32@gmail.com>, Adrian Chadd <adrian@freebsd.org>, freebsd-current <freebsd-current@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
Subject:   Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs?
Message-ID:  <E45CB08A-AC34-45FB-967E-FD467F1AF2A8@netflix.com>
In-Reply-To: <55D74193.4020008@FreeBSD.org>
References:  <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com> <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com> <55D74193.4020008@FreeBSD.org>

index | next in thread | previous in thread | raw e-mail

I might have a fix for this, I’ll check the netflix repo and see if it’s something that is ready to go upstream to freebsd.

Scott

> On Aug 21, 2015, at 4:19 PM, Eric van Gyzen <vangyzen@FreeBSD.org> wrote:
> 
> I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
> 
> Ryan is exactly right.  There was a thread a while ago, with a proposed patch from Kostik:
> 
> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
> 
> As I recall, Scott Long also ran into this a few months ago.
> 
> It happens for any NMI:  entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
> 
> Eric
> 
> On 08/21/2015 09:23, Ryan Stone wrote:
>> I have seen similar behaviour before.  The problem is that every CPU
>> receives an NMI concurrently.  As I recall, one of them gets some kind of
>> pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
>> because they are already in an NMI handler, they don't get the second NMI
>> and don't stop properly.
>> 
>> The case that I saw actually had to do with a panic triggered by an NMI,
>> not entering the debugger, but I believe that both cases use
>> stop_cpus_hard() under the hood and have a similar issue.
>> 
>> (I also recall seeing the exact situation that you describe while
>> originally developing SR-IOV on an alpha version of the Fortville hardware
>> and firmware with a very buggy SR-IOV implementation.  I've never seen it
>> on ixgbe before, although I haven't used SR-IOV there very much at all)
>> 
>> 
>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>> 
>>> Hi!
>>> 
>>> This has started happening on -HEAD recently. No, I don't have any
>>> more details yet than "recently."
>>> 
>>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>>> have any ideas?
>>> 
>>> 
>>> -adrian
> 



help

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E45CB08A-AC34-45FB-967E-FD467F1AF2A8>