From owner-freebsd-arch@freebsd.org Thu Aug 20 22:15:09 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1E75E9BE5C6; Thu, 20 Aug 2015 22:15:09 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-ig0-x22a.google.com (mail-ig0-x22a.google.com [IPv6:2607:f8b0:4001:c05::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E2779BB3; Thu, 20 Aug 2015 22:15:08 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by igxp17 with SMTP id p17so1709222igx.1; Thu, 20 Aug 2015 15:15:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to:content-type; bh=Il7S7eqt94dWYuTGW/z/8qoN8ZQszbjZaY1Np8xirhM=; b=B9sqlhRTtHhFWrHx6kLmvpKCXWolVf1T8h7slnmBSRVtRkTnpzjVz12ZaPqtnZlYPn YeNuiI+RvAGGN2mdQIRr2r4K/rYHi1YeRYe0DeLBw7O1xf2ClTbpV+jxqhgeS7xvc8P8 v4M3R5OUDYwET1IhfiPLMzIroQnlHFoBTIXR+myYL/Tfwl7GrV6oHxyWQohbRUc/jCVb LtKwGXWaScRAG4iniVwK++FfX+ZO/oB6uEgBEYwV1/1euBYiBJAbXgwO+M4hUqzvEmQv ZpoanVGeZb3S2sAU3pBwRRVHtfCEsyPl4Ef7M6al6CebIN3k/dc2ck5x6JpRnZMIAA9y 61hA== MIME-Version: 1.0 X-Received: by 10.50.28.70 with SMTP id z6mr275073igg.61.1440108908297; Thu, 20 Aug 2015 15:15:08 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Thu, 20 Aug 2015 15:15:08 -0700 (PDT) Date: Thu, 20 Aug 2015 15:15:08 -0700 X-Google-Sender-Auth: zKga-ms6LMw5DejYcOr_EiFXwIo Message-ID: Subject: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Adrian Chadd To: "freebsd-arch@freebsd.org" , freebsd-current Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 20 Aug 2015 22:15:09 -0000 Hi! This has started happening on -HEAD recently. No, I don't have any more details yet than "recently." Whenever I get an NMI panic (and getting an NMI is a separate issue, sigh) I get a slew of "failed to stop cpu" messages, and all CPUs enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone have any ideas? -adrian From owner-freebsd-arch@freebsd.org Fri Aug 21 14:23:37 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 464CF9BFB3C; Fri, 21 Aug 2015 14:23:37 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: from mail-ig0-x22c.google.com (mail-ig0-x22c.google.com [IPv6:2607:f8b0:4001:c05::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 121E71F6; Fri, 21 Aug 2015 14:23:37 +0000 (UTC) (envelope-from rysto32@gmail.com) Received: by igcse8 with SMTP id se8so1064326igc.1; Fri, 21 Aug 2015 07:23:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc:content-type; bh=8XpdXy5a+XfqrUSNmA5faFNfkPrxj2fATsWUZ3hRh6c=; b=JYaH0kDUSuAZCaGfhXL6s7u+fBelpNf9Yc4Pcba1IF2e3YEmYbvUCn+hL/LiVbWTL0 LWqpJoIsbeYl/5kGRtm/sF76NTA2IQMpMaPa2bATWO50lPzn2JXmjIfhSn/+r8jK6fu/ lWqAsUN6CGDWK1uM/5kA1/oydup6GhDoYtS7cPECcddhPlX3SxUSMmuH6bVSBX52jvNr Q0vBm0TdSWfcrzykHi/RTefvj5sve48UFe4OjFhxsCoTwDv1UUebkE8Q2vby/gEw94yT C2czBUqNlG5So5QCBzGbOG4fdB/0fMcgDiHSE1V097noMbtf5VkurU9UbrjmM3jYOWqB zpoQ== MIME-Version: 1.0 X-Received: by 10.50.124.4 with SMTP id me4mr3174071igb.34.1440167016204; Fri, 21 Aug 2015 07:23:36 -0700 (PDT) Received: by 10.107.169.94 with HTTP; Fri, 21 Aug 2015 07:23:36 -0700 (PDT) In-Reply-To: References: Date: Fri, 21 Aug 2015 10:23:36 -0400 Message-ID: Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Ryan Stone To: Adrian Chadd Cc: "freebsd-arch@freebsd.org" , freebsd-current Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.20 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 14:23:37 -0000 I have seen similar behaviour before. The problem is that every CPU receives an NMI concurrently. As I recall, one of them gets some kind of pseudo-spinlock and tries to stop the other CPUs with an NMI. However, because they are already in an NMI handler, they don't get the second NMI and don't stop properly. The case that I saw actually had to do with a panic triggered by an NMI, not entering the debugger, but I believe that both cases use stop_cpus_hard() under the hood and have a similar issue. (I also recall seeing the exact situation that you describe while originally developing SR-IOV on an alpha version of the Fortville hardware and firmware with a very buggy SR-IOV implementation. I've never seen it on ixgbe before, although I haven't used SR-IOV there very much at all) On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: > Hi! > > This has started happening on -HEAD recently. No, I don't have any > more details yet than "recently." > > Whenever I get an NMI panic (and getting an NMI is a separate issue, > sigh) I get a slew of "failed to stop cpu" messages, and all CPUs > enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone > have any ideas? > > > -adrian > _______________________________________________ > freebsd-arch@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-arch > To unsubscribe, send any mail to "freebsd-arch-unsubscribe@freebsd.org" > From owner-freebsd-arch@freebsd.org Fri Aug 21 15:19:52 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F25659BF562; Fri, 21 Aug 2015 15:19:52 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from smtp.vangyzen.net (hotblack.vangyzen.net [199.48.133.146]) by mx1.freebsd.org (Postfix) with ESMTP id D6EF21F7D; Fri, 21 Aug 2015 15:19:52 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from marvin.beer.town (unknown [76.164.8.130]) by smtp.vangyzen.net (Postfix) with ESMTPSA id F1B9556486; Fri, 21 Aug 2015 10:19:48 -0500 (CDT) Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? To: Ryan Stone , Adrian Chadd References: Cc: freebsd-current , "freebsd-arch@freebsd.org" , Scott Long , Konstantin Belousov From: Eric van Gyzen X-Enigmail-Draft-Status: N1110 Message-ID: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 10:19:47 -0500 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:19:53 -0000 I mentioned this to Adrian, but I'll mention here for everyone else's benefit. Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html As I recall, Scott Long also ran into this a few months ago. It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. Eric On 08/21/2015 09:23, Ryan Stone wrote: > I have seen similar behaviour before. The problem is that every CPU > receives an NMI concurrently. As I recall, one of them gets some kind of > pseudo-spinlock and tries to stop the other CPUs with an NMI. However, > because they are already in an NMI handler, they don't get the second NMI > and don't stop properly. > > The case that I saw actually had to do with a panic triggered by an NMI, > not entering the debugger, but I believe that both cases use > stop_cpus_hard() under the hood and have a similar issue. > > (I also recall seeing the exact situation that you describe while > originally developing SR-IOV on an alpha version of the Fortville hardware > and firmware with a very buggy SR-IOV implementation. I've never seen it > on ixgbe before, although I haven't used SR-IOV there very much at all) > > > On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: > >> Hi! >> >> This has started happening on -HEAD recently. No, I don't have any >> more details yet than "recently." >> >> Whenever I get an NMI panic (and getting an NMI is a separate issue, >> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >> have any ideas? >> >> >> -adrian From owner-freebsd-arch@freebsd.org Fri Aug 21 15:25:28 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2BCD39BF7F8; Fri, 21 Aug 2015 15:25:28 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com [IPv6:2607:f8b0:4001:c05::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EA661D96; Fri, 21 Aug 2015 15:25:27 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by igcse8 with SMTP id se8so2556526igc.1; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=X4uZtVPkb9LhtlE7ioIifO+YWVaMEHNjk+VElTGPoNA=; b=DUj+A+Ts9d6yzbvPjeK0WH6U2MnhoQZCgDwQnIepKzmK2wjhZoUjVB7qWfY4eMZkuO it9oemnPI5SkaJukI3kzvQehKh4dmunsJw4KROi2HA2+FiEW34OqkRMzyfM949OrSvYR tXYklNtq8I8ZamFASkElAdsP9UWfYgf+kygn2BnmpYY8dRR/vpbg5RXl+18F1HXHV25N R5HSb4NXPWU04SdH1KRNGtP0lU7izDqgqb/E0i+OLY1VJ8F3Y/fDJTrtZefQPMOrskfJ HDAXG2YP45KaZ9lPevhMJk6GSHoQvovVHsDuH5k3B1PhooNZHlfRsRjdx2G/fsbr80Me KfWA== MIME-Version: 1.0 X-Received: by 10.50.128.169 with SMTP id np9mr3223564igb.37.1440170727275; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) In-Reply-To: <55D74193.4020008@FreeBSD.org> References: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 08:25:27 -0700 X-Google-Sender-Auth: NlJDAIsUGthreT7rUOt8WiaZWQU Message-ID: Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Adrian Chadd To: Eric van Gyzen Cc: Ryan Stone , freebsd-current , "freebsd-arch@freebsd.org" , Scott Long , Konstantin Belousov Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:25:28 -0000 Ah, cool. I'll give it a whirl. I'm a little worried about having all of the other cores spinning in this case (mostly thermal; the machines get VERY LOUD when the CPUs are spinning..) -a On 21 August 2015 at 08:19, Eric van Gyzen wrote: > I mentioned this to Adrian, but I'll mention here for everyone else's benefit. > > Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: > > https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html > > As I recall, Scott Long also ran into this a few months ago. > > It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. > > Eric > > On 08/21/2015 09:23, Ryan Stone wrote: >> I have seen similar behaviour before. The problem is that every CPU >> receives an NMI concurrently. As I recall, one of them gets some kind of >> pseudo-spinlock and tries to stop the other CPUs with an NMI. However, >> because they are already in an NMI handler, they don't get the second NMI >> and don't stop properly. >> >> The case that I saw actually had to do with a panic triggered by an NMI, >> not entering the debugger, but I believe that both cases use >> stop_cpus_hard() under the hood and have a similar issue. >> >> (I also recall seeing the exact situation that you describe while >> originally developing SR-IOV on an alpha version of the Fortville hardware >> and firmware with a very buggy SR-IOV implementation. I've never seen it >> on ixgbe before, although I haven't used SR-IOV there very much at all) >> >> >> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: >> >>> Hi! >>> >>> This has started happening on -HEAD recently. No, I don't have any >>> more details yet than "recently." >>> >>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>> have any ideas? >>> >>> >>> -adrian > From owner-freebsd-arch@freebsd.org Fri Aug 21 15:31:41 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2804D9BFA22 for ; Fri, 21 Aug 2015 15:31:41 +0000 (UTC) (envelope-from scottl@netflix.com) Received: from mail-qg0-x22f.google.com (mail-qg0-x22f.google.com [IPv6:2607:f8b0:400d:c04::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CC8CF15DE for ; Fri, 21 Aug 2015 15:31:40 +0000 (UTC) (envelope-from scottl@netflix.com) Received: by qgeb6 with SMTP id b6so48896785qge.3 for ; Fri, 21 Aug 2015 08:31:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=; b=XnE77acbCDIRDdsX0p49NR1y3rjwlvQZ4F+HgBuioz3tNalGbPDpqy+eO77+bWGQDH FibPJJpRl6fb2S+rWfUcu2ZnvKs4U9MSq4jKxbYxTj3GGopB2zl52i7+p91NApLmG8jj vvUeSKbcVorcCCzuSeVTQQUfjdd2JQT55whrg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=; b=OVa8EFxedzHU59T0RHfDgDq0z9qIgnkPuggSIsB3vzQY2nBFXNuEPkLX70PuYccJY9 YdfX15Z+XEUtpIlaJ3rAkrbf1kGttbkgEXMrVX0Di0tMlkhnCbiuu+0aaXzr0Cyp7A0n AIGGtISH4PSzzCmmwPmoKgTbofOA37tuMJ0+lML7qwt+8vtYva+n9VvtvbbU5h+oiyUB 3mAKykQM37z5op/9hz5d+2xj8kkX9yZsuNocVTJVTXkRELxiWRVvObG/3TSGplQc3Ha2 m0AzG16/4kh7NfmfhbaNHLYKTtMXDgxPcWxjYbj0uZm/YjhWYtjS87Sm0Nykyspwivtu zIpA== X-Gm-Message-State: ALoCoQkWCutQFGAcLgVrohbwF3ajmp0DxuYHw7G5erL6WzkeQ8YJr4aum0mkyAkCKRf0zCY/K9cK X-Received: by 10.140.232.20 with SMTP id d20mr19542800qhc.72.1440171099308; Fri, 21 Aug 2015 08:31:39 -0700 (PDT) Received: from [172.19.248.72] ([64.88.227.134]) by smtp.gmail.com with ESMTPSA id 36sm4555659qgp.8.2015.08.21.08.31.27 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 21 Aug 2015 08:31:38 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Scott Long In-Reply-To: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 16:31:13 +0100 Cc: Ryan Stone , Adrian Chadd , freebsd-current , "freebsd-arch@freebsd.org" , Konstantin Belousov Content-Transfer-Encoding: quoted-printable Message-Id: References: <55D74193.4020008@FreeBSD.org> To: Eric van Gyzen X-Mailer: Apple Mail (2.2098) X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:31:41 -0000 I might have a fix for this, I=E2=80=99ll check the netflix repo and see = if it=E2=80=99s something that is ready to go upstream to freebsd. Scott > On Aug 21, 2015, at 4:19 PM, Eric van Gyzen = wrote: >=20 > I mentioned this to Adrian, but I'll mention here for everyone else's = benefit. >=20 > Ryan is exactly right. There was a thread a while ago, with a = proposed patch from Kostik: >=20 > https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html >=20 > As I recall, Scott Long also ran into this a few months ago. >=20 > It happens for any NMI: entering the debugger, a PCI Parity or System = Error, a hardware watchdog timeout, and probably other sources I'm not = remembering. >=20 > Eric >=20 > On 08/21/2015 09:23, Ryan Stone wrote: >> I have seen similar behaviour before. The problem is that every CPU >> receives an NMI concurrently. As I recall, one of them gets some = kind of >> pseudo-spinlock and tries to stop the other CPUs with an NMI. = However, >> because they are already in an NMI handler, they don't get the second = NMI >> and don't stop properly. >>=20 >> The case that I saw actually had to do with a panic triggered by an = NMI, >> not entering the debugger, but I believe that both cases use >> stop_cpus_hard() under the hood and have a similar issue. >>=20 >> (I also recall seeing the exact situation that you describe while >> originally developing SR-IOV on an alpha version of the Fortville = hardware >> and firmware with a very buggy SR-IOV implementation. I've never = seen it >> on ixgbe before, although I haven't used SR-IOV there very much at = all) >>=20 >>=20 >> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd = wrote: >>=20 >>> Hi! >>>=20 >>> This has started happening on -HEAD recently. No, I don't have any >>> more details yet than "recently." >>>=20 >>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>> have any ideas? >>>=20 >>>=20 >>> -adrian >=20 From owner-freebsd-arch@freebsd.org Fri Aug 21 15:41:34 2015 Return-Path: Delivered-To: freebsd-arch@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9BA019BFD63; Fri, 21 Aug 2015 15:41:34 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from smtp.vangyzen.net (hotblack.vangyzen.net [IPv6:2607:fc50:1000:7400:216:3eff:fe72:314f]) by mx1.freebsd.org (Postfix) with ESMTP id 7E7FB8; Fri, 21 Aug 2015 15:41:34 +0000 (UTC) (envelope-from vangyzen@FreeBSD.org) Received: from marvin.beer.town (unknown [76.164.8.130]) by smtp.vangyzen.net (Postfix) with ESMTPSA id 08DC756486; Fri, 21 Aug 2015 10:41:32 -0500 (CDT) Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? To: Adrian Chadd References: <55D74193.4020008@FreeBSD.org> Cc: Ryan Stone , freebsd-current , "freebsd-arch@freebsd.org" , Scott Long , Konstantin Belousov From: Eric van Gyzen X-Enigmail-Draft-Status: N1110 Message-ID: <55D746AB.6040001@FreeBSD.org> Date: Fri, 21 Aug 2015 10:41:31 -0500 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:41:34 -0000 Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled). :-/ Eric On 08/21/2015 10:25, Adrian Chadd wrote: > Ah, cool. I'll give it a whirl. > > I'm a little worried about having all of the other cores spinning in > this case (mostly thermal; the machines get VERY LOUD when the CPUs > are spinning..) > > > -a > > > On 21 August 2015 at 08:19, Eric van Gyzen wrote: >> I mentioned this to Adrian, but I'll mention here for everyone else's benefit. >> >> Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: >> >> https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html >> >> As I recall, Scott Long also ran into this a few months ago. >> >> It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. >> >> Eric >> >> On 08/21/2015 09:23, Ryan Stone wrote: >>> I have seen similar behaviour before. The problem is that every CPU >>> receives an NMI concurrently. As I recall, one of them gets some kind of >>> pseudo-spinlock and tries to stop the other CPUs with an NMI. However, >>> because they are already in an NMI handler, they don't get the second NMI >>> and don't stop properly. >>> >>> The case that I saw actually had to do with a panic triggered by an NMI, >>> not entering the debugger, but I believe that both cases use >>> stop_cpus_hard() under the hood and have a similar issue. >>> >>> (I also recall seeing the exact situation that you describe while >>> originally developing SR-IOV on an alpha version of the Fortville hardware >>> and firmware with a very buggy SR-IOV implementation. I've never seen it >>> on ixgbe before, although I haven't used SR-IOV there very much at all) >>> >>> >>> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: >>> >>>> Hi! >>>> >>>> This has started happening on -HEAD recently. No, I don't have any >>>> more details yet than "recently." >>>> >>>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>>> have any ideas? >>>> >>>> >>>> -adrian >> >