Date: Fri, 21 Aug 2015 10:41:31 -0500 From: Eric van Gyzen <vangyzen@FreeBSD.org> To: Adrian Chadd <adrian@freebsd.org> Cc: Ryan Stone <rysto32@gmail.com>, freebsd-current <freebsd-current@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>, Scott Long <scottl@freebsd.org>, Konstantin Belousov <kib@freebsd.org> Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? Message-ID: <55D746AB.6040001@FreeBSD.org> In-Reply-To: <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG%2BDdZ3fZdvFnan06g@mail.gmail.com> References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com> <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com> <55D74193.4020008@FreeBSD.org> <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG%2BDdZ3fZdvFnan06g@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled). :-/ Eric On 08/21/2015 10:25, Adrian Chadd wrote: > Ah, cool. I'll give it a whirl. > > I'm a little worried about having all of the other cores spinning in > this case (mostly thermal; the machines get VERY LOUD when the CPUs > are spinning..) > > > -a > > > On 21 August 2015 at 08:19, Eric van Gyzen <vangyzen@freebsd.org> wrote: >> I mentioned this to Adrian, but I'll mention here for everyone else's benefit. >> >> Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: >> >> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html >> >> As I recall, Scott Long also ran into this a few months ago. >> >> It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. >> >> Eric >> >> On 08/21/2015 09:23, Ryan Stone wrote: >>> I have seen similar behaviour before. The problem is that every CPU >>> receives an NMI concurrently. As I recall, one of them gets some kind of >>> pseudo-spinlock and tries to stop the other CPUs with an NMI. However, >>> because they are already in an NMI handler, they don't get the second NMI >>> and don't stop properly. >>> >>> The case that I saw actually had to do with a panic triggered by an NMI, >>> not entering the debugger, but I believe that both cases use >>> stop_cpus_hard() under the hood and have a similar issue. >>> >>> (I also recall seeing the exact situation that you describe while >>> originally developing SR-IOV on an alpha version of the Fortville hardware >>> and firmware with a very buggy SR-IOV implementation. I've never seen it >>> on ixgbe before, although I haven't used SR-IOV there very much at all) >>> >>> >>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote: >>> >>>> Hi! >>>> >>>> This has started happening on -HEAD recently. No, I don't have any >>>> more details yet than "recently." >>>> >>>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>>> have any ideas? >>>> >>>> >>>> -adrian >> >
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?55D746AB.6040001>