From owner-freebsd-arch@freebsd.org  Thu Aug 20 22:15:09 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1E75E9BE5C6;
 Thu, 20 Aug 2015 22:15:09 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-ig0-x22a.google.com (mail-ig0-x22a.google.com
 [IPv6:2607:f8b0:4001:c05::22a])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id E2779BB3;
 Thu, 20 Aug 2015 22:15:08 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: by igxp17 with SMTP id p17so1709222igx.1;
 Thu, 20 Aug 2015 15:15:08 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:date:message-id:subject:from:to:content-type;
 bh=Il7S7eqt94dWYuTGW/z/8qoN8ZQszbjZaY1Np8xirhM=;
 b=B9sqlhRTtHhFWrHx6kLmvpKCXWolVf1T8h7slnmBSRVtRkTnpzjVz12ZaPqtnZlYPn
 YeNuiI+RvAGGN2mdQIRr2r4K/rYHi1YeRYe0DeLBw7O1xf2ClTbpV+jxqhgeS7xvc8P8
 v4M3R5OUDYwET1IhfiPLMzIroQnlHFoBTIXR+myYL/Tfwl7GrV6oHxyWQohbRUc/jCVb
 LtKwGXWaScRAG4iniVwK++FfX+ZO/oB6uEgBEYwV1/1euBYiBJAbXgwO+M4hUqzvEmQv
 ZpoanVGeZb3S2sAU3pBwRRVHtfCEsyPl4Ef7M6al6CebIN3k/dc2ck5x6JpRnZMIAA9y
 61hA==
MIME-Version: 1.0
X-Received: by 10.50.28.70 with SMTP id z6mr275073igg.61.1440108908297; Thu,
 20 Aug 2015 15:15:08 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Thu, 20 Aug 2015 15:15:08 -0700 (PDT)
Date: Thu, 20 Aug 2015 15:15:08 -0700
X-Google-Sender-Auth: zKga-ms6LMw5DejYcOr_EiFXwIo
Message-ID: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
Subject: freebsd-head: suddenly NMI panics lead to ddb being unable to stop
 CPUs?
From: Adrian Chadd <adrian@freebsd.org>
To: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 freebsd-current <freebsd-current@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 20 Aug 2015 22:15:09 -0000

Hi!

This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."

Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?


-adrian

From owner-freebsd-arch@freebsd.org  Fri Aug 21 14:23:37 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 464CF9BFB3C;
 Fri, 21 Aug 2015 14:23:37 +0000 (UTC)
 (envelope-from rysto32@gmail.com)
Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com
 [IPv6:2607:f8b0:4001:c05::22c])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id 121E71F6;
 Fri, 21 Aug 2015 14:23:37 +0000 (UTC)
 (envelope-from rysto32@gmail.com)
Received: by igcse8 with SMTP id se8so1064326igc.1;
 Fri, 21 Aug 2015 07:23:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:in-reply-to:references:date:message-id:subject:from:to
 :cc:content-type;
 bh=8XpdXy5a+XfqrUSNmA5faFNfkPrxj2fATsWUZ3hRh6c=;
 b=JYaH0kDUSuAZCaGfhXL6s7u+fBelpNf9Yc4Pcba1IF2e3YEmYbvUCn+hL/LiVbWTL0
 LWqpJoIsbeYl/5kGRtm/sF76NTA2IQMpMaPa2bATWO50lPzn2JXmjIfhSn/+r8jK6fu/
 lWqAsUN6CGDWK1uM/5kA1/oydup6GhDoYtS7cPECcddhPlX3SxUSMmuH6bVSBX52jvNr
 Q0vBm0TdSWfcrzykHi/RTefvj5sve48UFe4OjFhxsCoTwDv1UUebkE8Q2vby/gEw94yT
 C2czBUqNlG5So5QCBzGbOG4fdB/0fMcgDiHSE1V097noMbtf5VkurU9UbrjmM3jYOWqB
 zpoQ==
MIME-Version: 1.0
X-Received: by 10.50.124.4 with SMTP id me4mr3174071igb.34.1440167016204; Fri,
 21 Aug 2015 07:23:36 -0700 (PDT)
Received: by 10.107.169.94 with HTTP; Fri, 21 Aug 2015 07:23:36 -0700 (PDT)
In-Reply-To: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
Date: Fri, 21 Aug 2015 10:23:36 -0400
Message-ID: <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
From: Ryan Stone <rysto32@gmail.com>
To: Adrian Chadd <adrian@freebsd.org>
Cc: "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 freebsd-current <freebsd-current@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-Content-Filtered-By: Mailman/MimeDel 2.1.20
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 14:23:37 -0000

I have seen similar behaviour before.  The problem is that every CPU
receives an NMI concurrently.  As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.

The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.

(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation.  I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)


On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:

> Hi!
>
> This has started happening on -HEAD recently. No, I don't have any
> more details yet than "recently."
>
> Whenever I get an NMI panic (and getting an NMI is a separate issue,
> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
> have any ideas?
>
>
> -adrian
> _______________________________________________
> freebsd-arch@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-arch
> To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org"
>

From owner-freebsd-arch@freebsd.org  Fri Aug 21 15:19:52 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id F25659BF562;
 Fri, 21 Aug 2015 15:19:52 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from smtp.vangyzen.net (hotblack.vangyzen.net [199.48.133.146])
 by mx1.freebsd.org (Postfix) with ESMTP id D6EF21F7D;
 Fri, 21 Aug 2015 15:19:52 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from marvin.beer.town (unknown [76.164.8.130])
 by smtp.vangyzen.net (Postfix) with ESMTPSA id F1B9556486;
 Fri, 21 Aug 2015 10:19:48 -0500 (CDT)
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
To: Ryan Stone <rysto32@gmail.com>, Adrian Chadd <adrian@freebsd.org>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
 <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
Cc: freebsd-current <freebsd-current@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Scott Long <scottl@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
From: Eric van Gyzen <vangyzen@FreeBSD.org>
X-Enigmail-Draft-Status: N1110
Message-ID: <55D74193.4020008@FreeBSD.org>
Date: Fri, 21 Aug 2015 10:19:47 -0500
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 15:19:53 -0000

I mentioned this to Adrian, but I'll mention here for everyone else's benefit.

Ryan is exactly right.  There was a thread a while ago, with a proposed patch from Kostik:

https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html

As I recall, Scott Long also ran into this a few months ago.

It happens for any NMI:  entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.

Eric

On 08/21/2015 09:23, Ryan Stone wrote:
> I have seen similar behaviour before.  The problem is that every CPU
> receives an NMI concurrently.  As I recall, one of them gets some kind of
> pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
> because they are already in an NMI handler, they don't get the second NMI
> and don't stop properly.
>
> The case that I saw actually had to do with a panic triggered by an NMI,
> not entering the debugger, but I believe that both cases use
> stop_cpus_hard() under the hood and have a similar issue.
>
> (I also recall seeing the exact situation that you describe while
> originally developing SR-IOV on an alpha version of the Fortville hardware
> and firmware with a very buggy SR-IOV implementation.  I've never seen it
> on ixgbe before, although I haven't used SR-IOV there very much at all)
>
>
> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>
>> Hi!
>>
>> This has started happening on -HEAD recently. No, I don't have any
>> more details yet than "recently."
>>
>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>> have any ideas?
>>
>>
>> -adrian


From owner-freebsd-arch@freebsd.org  Fri Aug 21 15:25:28 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2BCD39BF7F8;
 Fri, 21 Aug 2015 15:25:28 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com
 [IPv6:2607:f8b0:4001:c05::22b])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id EA661D96;
 Fri, 21 Aug 2015 15:25:27 +0000 (UTC)
 (envelope-from adrian.chadd@gmail.com)
Received: by igcse8 with SMTP id se8so2556526igc.1;
 Fri, 21 Aug 2015 08:25:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:date:message-id:subject
 :from:to:cc:content-type;
 bh=X4uZtVPkb9LhtlE7ioIifO+YWVaMEHNjk+VElTGPoNA=;
 b=DUj+A+Ts9d6yzbvPjeK0WH6U2MnhoQZCgDwQnIepKzmK2wjhZoUjVB7qWfY4eMZkuO
 it9oemnPI5SkaJukI3kzvQehKh4dmunsJw4KROi2HA2+FiEW34OqkRMzyfM949OrSvYR
 tXYklNtq8I8ZamFASkElAdsP9UWfYgf+kygn2BnmpYY8dRR/vpbg5RXl+18F1HXHV25N
 R5HSb4NXPWU04SdH1KRNGtP0lU7izDqgqb/E0i+OLY1VJ8F3Y/fDJTrtZefQPMOrskfJ
 HDAXG2YP45KaZ9lPevhMJk6GSHoQvovVHsDuH5k3B1PhooNZHlfRsRjdx2G/fsbr80Me
 KfWA==
MIME-Version: 1.0
X-Received: by 10.50.128.169 with SMTP id np9mr3223564igb.37.1440170727275;
 Fri, 21 Aug 2015 08:25:27 -0700 (PDT)
Sender: adrian.chadd@gmail.com
Received: by 10.36.38.133 with HTTP; Fri, 21 Aug 2015 08:25:27 -0700 (PDT)
In-Reply-To: <55D74193.4020008@FreeBSD.org>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
 <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
 <55D74193.4020008@FreeBSD.org>
Date: Fri, 21 Aug 2015 08:25:27 -0700
X-Google-Sender-Auth: NlJDAIsUGthreT7rUOt8WiaZWQU
Message-ID: <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG+DdZ3fZdvFnan06g@mail.gmail.com>
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
From: Adrian Chadd <adrian@freebsd.org>
To: Eric van Gyzen <vangyzen@freebsd.org>
Cc: Ryan Stone <rysto32@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>, 
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Scott Long <scottl@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
Content-Type: text/plain; charset=UTF-8
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 15:25:28 -0000

Ah, cool. I'll give it a whirl.

I'm a little worried about having all of the other cores spinning in
this case (mostly thermal; the machines get VERY LOUD when the CPUs
are spinning..)


-a


On 21 August 2015 at 08:19, Eric van Gyzen <vangyzen@freebsd.org> wrote:
> I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
>
> Ryan is exactly right.  There was a thread a while ago, with a proposed patch from Kostik:
>
> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
>
> As I recall, Scott Long also ran into this a few months ago.
>
> It happens for any NMI:  entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
>
> Eric
>
> On 08/21/2015 09:23, Ryan Stone wrote:
>> I have seen similar behaviour before.  The problem is that every CPU
>> receives an NMI concurrently.  As I recall, one of them gets some kind of
>> pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
>> because they are already in an NMI handler, they don't get the second NMI
>> and don't stop properly.
>>
>> The case that I saw actually had to do with a panic triggered by an NMI,
>> not entering the debugger, but I believe that both cases use
>> stop_cpus_hard() under the hood and have a similar issue.
>>
>> (I also recall seeing the exact situation that you describe while
>> originally developing SR-IOV on an alpha version of the Fortville hardware
>> and firmware with a very buggy SR-IOV implementation.  I've never seen it
>> on ixgbe before, although I haven't used SR-IOV there very much at all)
>>
>>
>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>>
>>> Hi!
>>>
>>> This has started happening on -HEAD recently. No, I don't have any
>>> more details yet than "recently."
>>>
>>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>>> have any ideas?
>>>
>>>
>>> -adrian
>

From owner-freebsd-arch@freebsd.org  Fri Aug 21 15:31:41 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2804D9BFA22
 for <freebsd-arch@mailman.ysv.freebsd.org>;
 Fri, 21 Aug 2015 15:31:41 +0000 (UTC)
 (envelope-from scottl@netflix.com)
Received: from mail-qg0-x22f.google.com (mail-qg0-x22f.google.com
 [IPv6:2607:f8b0:400d:c04::22f])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (Client CN "smtp.gmail.com",
 Issuer "Google Internet Authority G2" (verified OK))
 by mx1.freebsd.org (Postfix) with ESMTPS id CC8CF15DE
 for <freebsd-arch@freebsd.org>; Fri, 21 Aug 2015 15:31:40 +0000 (UTC)
 (envelope-from scottl@netflix.com)
Received: by qgeb6 with SMTP id b6so48896785qge.3
 for <freebsd-arch@freebsd.org>; Fri, 21 Aug 2015 08:31:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google;
 h=content-type:mime-version:subject:from:in-reply-to:date:cc
 :content-transfer-encoding:message-id:references:to;
 bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=;
 b=XnE77acbCDIRDdsX0p49NR1y3rjwlvQZ4F+HgBuioz3tNalGbPDpqy+eO77+bWGQDH
 FibPJJpRl6fb2S+rWfUcu2ZnvKs4U9MSq4jKxbYxTj3GGopB2zl52i7+p91NApLmG8jj
 vvUeSKbcVorcCCzuSeVTQQUfjdd2JQT55whrg=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20130820;
 h=x-gm-message-state:content-type:mime-version:subject:from
 :in-reply-to:date:cc:content-transfer-encoding:message-id:references
 :to; bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=;
 b=OVa8EFxedzHU59T0RHfDgDq0z9qIgnkPuggSIsB3vzQY2nBFXNuEPkLX70PuYccJY9
 YdfX15Z+XEUtpIlaJ3rAkrbf1kGttbkgEXMrVX0Di0tMlkhnCbiuu+0aaXzr0Cyp7A0n
 AIGGtISH4PSzzCmmwPmoKgTbofOA37tuMJ0+lML7qwt+8vtYva+n9VvtvbbU5h+oiyUB
 3mAKykQM37z5op/9hz5d+2xj8kkX9yZsuNocVTJVTXkRELxiWRVvObG/3TSGplQc3Ha2
 m0AzG16/4kh7NfmfhbaNHLYKTtMXDgxPcWxjYbj0uZm/YjhWYtjS87Sm0Nykyspwivtu
 zIpA==
X-Gm-Message-State: ALoCoQkWCutQFGAcLgVrohbwF3ajmp0DxuYHw7G5erL6WzkeQ8YJr4aum0mkyAkCKRf0zCY/K9cK
X-Received: by 10.140.232.20 with SMTP id d20mr19542800qhc.72.1440171099308;
 Fri, 21 Aug 2015 08:31:39 -0700 (PDT)
Received: from [172.19.248.72] ([64.88.227.134])
 by smtp.gmail.com with ESMTPSA id 36sm4555659qgp.8.2015.08.21.08.31.27
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Fri, 21 Aug 2015 08:31:38 -0700 (PDT)
Content-Type: text/plain; charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\))
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
From: Scott Long <scottl@netflix.com>
In-Reply-To: <55D74193.4020008@FreeBSD.org>
Date: Fri, 21 Aug 2015 16:31:13 +0100
Cc: Ryan Stone <rysto32@gmail.com>, Adrian Chadd <adrian@freebsd.org>,
 freebsd-current <freebsd-current@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Konstantin Belousov <kib@freebsd.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <E45CB08A-AC34-45FB-967E-FD467F1AF2A8@netflix.com>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
 <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
 <55D74193.4020008@FreeBSD.org>
To: Eric van Gyzen <vangyzen@FreeBSD.org>
X-Mailer: Apple Mail (2.2098)
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 15:31:41 -0000

I might have a fix for this, I=E2=80=99ll check the netflix repo and see =
if it=E2=80=99s something that is ready to go upstream to freebsd.

Scott

> On Aug 21, 2015, at 4:19 PM, Eric van Gyzen <vangyzen@FreeBSD.org> =
wrote:
>=20
> I mentioned this to Adrian, but I'll mention here for everyone else's =
benefit.
>=20
> Ryan is exactly right.  There was a thread a while ago, with a =
proposed patch from Kostik:
>=20
> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
>=20
> As I recall, Scott Long also ran into this a few months ago.
>=20
> It happens for any NMI:  entering the debugger, a PCI Parity or System =
Error, a hardware watchdog timeout, and probably other sources I'm not =
remembering.
>=20
> Eric
>=20
> On 08/21/2015 09:23, Ryan Stone wrote:
>> I have seen similar behaviour before.  The problem is that every CPU
>> receives an NMI concurrently.  As I recall, one of them gets some =
kind of
>> pseudo-spinlock and tries to stop the other CPUs with an NMI.  =
However,
>> because they are already in an NMI handler, they don't get the second =
NMI
>> and don't stop properly.
>>=20
>> The case that I saw actually had to do with a panic triggered by an =
NMI,
>> not entering the debugger, but I believe that both cases use
>> stop_cpus_hard() under the hood and have a similar issue.
>>=20
>> (I also recall seeing the exact situation that you describe while
>> originally developing SR-IOV on an alpha version of the Fortville =
hardware
>> and firmware with a very buggy SR-IOV implementation.  I've never =
seen it
>> on ixgbe before, although I haven't used SR-IOV there very much at =
all)
>>=20
>>=20
>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> =
wrote:
>>=20
>>> Hi!
>>>=20
>>> This has started happening on -HEAD recently. No, I don't have any
>>> more details yet than "recently."
>>>=20
>>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>>> have any ideas?
>>>=20
>>>=20
>>> -adrian
>=20


From owner-freebsd-arch@freebsd.org  Fri Aug 21 15:41:34 2015
Return-Path: <owner-freebsd-arch@freebsd.org>
Delivered-To: freebsd-arch@mailman.ysv.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org
 [IPv6:2001:1900:2254:206a::19:1])
 by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9BA019BFD63;
 Fri, 21 Aug 2015 15:41:34 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from smtp.vangyzen.net (hotblack.vangyzen.net
 [IPv6:2607:fc50:1000:7400:216:3eff:fe72:314f])
 by mx1.freebsd.org (Postfix) with ESMTP id 7E7FB8;
 Fri, 21 Aug 2015 15:41:34 +0000 (UTC)
 (envelope-from vangyzen@FreeBSD.org)
Received: from marvin.beer.town (unknown [76.164.8.130])
 by smtp.vangyzen.net (Postfix) with ESMTPSA id 08DC756486;
 Fri, 21 Aug 2015 10:41:32 -0500 (CDT)
Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to
 stop CPUs?
To: Adrian Chadd <adrian@freebsd.org>
References: <CAJ-VmomvqULP--v47qKJisQkf8VQNvxEhXK=HXEtv9MuLz4D1g@mail.gmail.com>
 <CAFMmRNw6tWMQ-pfXzSpEM7kRgKafB9KnK-oUhWw2_E-P91drLw@mail.gmail.com>
 <55D74193.4020008@FreeBSD.org>
 <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG+DdZ3fZdvFnan06g@mail.gmail.com>
Cc: Ryan Stone <rysto32@gmail.com>,
 freebsd-current <freebsd-current@freebsd.org>,
 "freebsd-arch@freebsd.org" <freebsd-arch@freebsd.org>,
 Scott Long <scottl@freebsd.org>, Konstantin Belousov <kib@freebsd.org>
From: Eric van Gyzen <vangyzen@FreeBSD.org>
X-Enigmail-Draft-Status: N1110
Message-ID: <55D746AB.6040001@FreeBSD.org>
Date: Fri, 21 Aug 2015 10:41:31 -0500
User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <CAJ-Vmon6xXBSMPWgNhg-RZKLuuMDP1hvXG+DdZ3fZdvFnan06g@mail.gmail.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-BeenThere: freebsd-arch@freebsd.org
X-Mailman-Version: 2.1.20
Precedence: list
List-Id: Discussion related to FreeBSD architecture <freebsd-arch.freebsd.org>
List-Unsubscribe: <https://lists.freebsd.org/mailman/options/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-arch/>
List-Post: <mailto:freebsd-arch@freebsd.org>
List-Help: <mailto:freebsd-arch-request@freebsd.org?subject=help>
List-Subscribe: <https://lists.freebsd.org/mailman/listinfo/freebsd-arch>,
 <mailto:freebsd-arch-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 21 Aug 2015 15:41:34 -0000

Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled).  :-/

Eric

On 08/21/2015 10:25, Adrian Chadd wrote:
> Ah, cool. I'll give it a whirl.
> 
> I'm a little worried about having all of the other cores spinning in
> this case (mostly thermal; the machines get VERY LOUD when the CPUs
> are spinning..)
> 
> 
> -a
> 
> 
> On 21 August 2015 at 08:19, Eric van Gyzen <vangyzen@freebsd.org> wrote:
>> I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
>>
>> Ryan is exactly right.  There was a thread a while ago, with a proposed patch from Kostik:
>>
>> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
>>
>> As I recall, Scott Long also ran into this a few months ago.
>>
>> It happens for any NMI:  entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
>>
>> Eric
>>
>> On 08/21/2015 09:23, Ryan Stone wrote:
>>> I have seen similar behaviour before.  The problem is that every CPU
>>> receives an NMI concurrently.  As I recall, one of them gets some kind of
>>> pseudo-spinlock and tries to stop the other CPUs with an NMI.  However,
>>> because they are already in an NMI handler, they don't get the second NMI
>>> and don't stop properly.
>>>
>>> The case that I saw actually had to do with a panic triggered by an NMI,
>>> not entering the debugger, but I believe that both cases use
>>> stop_cpus_hard() under the hood and have a similar issue.
>>>
>>> (I also recall seeing the exact situation that you describe while
>>> originally developing SR-IOV on an alpha version of the Fortville hardware
>>> and firmware with a very buggy SR-IOV implementation.  I've never seen it
>>> on ixgbe before, although I haven't used SR-IOV there very much at all)
>>>
>>>
>>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd <adrian@freebsd.org> wrote:
>>>
>>>> Hi!
>>>>
>>>> This has started happening on -HEAD recently. No, I don't have any
>>>> more details yet than "recently."
>>>>
>>>> Whenever I get an NMI panic (and getting an NMI is a separate issue,
>>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
>>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
>>>> have any ideas?
>>>>
>>>>
>>>> -adrian
>>
>