Date: Fri, 21 Aug 2015 16:31:13 +0100 From: Scott Long <scottl@netflix.com> To: Eric van Gyzen <vangyzen@FreeBSD.org> Cc: Ryan Stone <rysto32@gmail.com>, Adrian Chadd <adrian@freebsd.org>, freebsd-current <freebsd-current@freebsd.org>, "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>, Konstantin Belousov <kib@freebsd.org> Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? Message-ID: <E45CB08A-AC34-45FB-967E-FD467F1AF2A8@netflix.com> In-Reply-To: <55D74193.4020008@FreeBSD.org> References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com> <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com> <55D74193.4020008@FreeBSD.org>
next in thread | previous in thread | raw e-mail | index | archive | help
I might have a fix for this, I=E2=80=99ll check the netflix repo and see = if it=E2=80=99s something that is ready to go upstream to freebsd. Scott > On Aug 21, 2015, at 4:19 PM, Eric van Gyzen <vangyzen@FreeBSD.org> = wrote: >=20 > I mentioned this to Adrian, but I'll mention here for everyone else's = benefit. >=20 > Ryan is exactly right. There was a thread a while ago, with a = proposed patch from Kostik: >=20 > https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html >=20 > As I recall, Scott Long also ran into this a few months ago. >=20 > It happens for any NMI: entering the debugger, a PCI Parity or System = Error, a hardware watchdog timeout, and probably other sources I'm not = remembering. >=20 > Eric >=20 > On 08/21/2015 09:23, Ryan Stone wrote: >> I have seen similar behaviour before. The problem is that every CPU >> receives an NMI concurrently. As I recall, one of them gets some = kind of >> pseudo-spinlock and tries to stop the other CPUs with an NMI. = However, >> because they are already in an NMI handler, they don't get the second = NMI >> and don't stop properly. >>=20 >> The case that I saw actually had to do with a panic triggered by an = NMI, >> not entering the debugger, but I believe that both cases use >> stop_cpus_hard() under the hood and have a similar issue. >>=20 >> (I also recall seeing the exact situation that you describe while >> originally developing SR-IOV on an alpha version of the Fortville = hardware >> and firmware with a very buggy SR-IOV implementation. I've never = seen it >> on ixgbe before, although I haven't used SR-IOV there very much at = all) >>=20 >>=20 >> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> = wrote: >>=20 >>> Hi! >>>=20 >>> This has started happening on -HEAD recently. No, I don't have any >>> more details yet than "recently." >>>=20 >>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>> have any ideas? >>>=20 >>>=20 >>> -adrian >=20
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?E45CB08A-AC34-45FB-967E-FD467F1AF2A8>