From owner-freebsd-current@freebsd.org Fri Aug 21 15:31:41 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 313A99BFA23 for ; Fri, 21 Aug 2015 15:31:41 +0000 (UTC) (envelope-from scottl@netflix.com) Received: from mail-qg0-x22e.google.com (mail-qg0-x22e.google.com [IPv6:2607:f8b0:400d:c04::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D4A1815E0 for ; Fri, 21 Aug 2015 15:31:40 +0000 (UTC) (envelope-from scottl@netflix.com) Received: by qgj62 with SMTP id 62so48914245qgj.2 for ; Fri, 21 Aug 2015 08:31:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=netflix.com; s=google; h=content-type:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=; b=XnE77acbCDIRDdsX0p49NR1y3rjwlvQZ4F+HgBuioz3tNalGbPDpqy+eO77+bWGQDH FibPJJpRl6fb2S+rWfUcu2ZnvKs4U9MSq4jKxbYxTj3GGopB2zl52i7+p91NApLmG8jj vvUeSKbcVorcCCzuSeVTQQUfjdd2JQT55whrg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:content-type:mime-version:subject:from :in-reply-to:date:cc:content-transfer-encoding:message-id:references :to; bh=uGrVrWiUW14/gt6oq/sH1XS8e3oBwPAgU/YbwFOmAwc=; b=lnKiCB9eAN/A3Qam63xGIu47No8djKtad1sYGVAguSH2I/p7g129zftA6tvA+ISvnk H1IeN4LYIDJBuwU/mcMkSr12xyiBkeBSAarhnQhBm3oAUm3Y/LkvI/JWEXK4f9frDZQA YJtMH3ezaCI6IgAzdlm7rqdyisqgiPo5nCX5HUvt9bst5E+ORRrHL2gAzLoK1J1cYZdS 88oWVftPpln7V3GsyqihZGzco1FjOpFWSVySEpbP0bi8It6rDIJTbfd6tKVm14Dnfvyt vy3lXvlRrcmM8xhryzM8BocL7kcZpc4kNRXCBQvSqIHBkB6Nep/Zo984n58VIFYHxLPl SqCw== X-Gm-Message-State: ALoCoQlxQY9IAWnAug2YCbnvpm7/B7cNkxOjR4w3CpxB0eMiZXbfyNiJzX/zjsJ2eq2SI9115vlV X-Received: by 10.140.232.20 with SMTP id d20mr19542800qhc.72.1440171099308; Fri, 21 Aug 2015 08:31:39 -0700 (PDT) Received: from [172.19.248.72] ([64.88.227.134]) by smtp.gmail.com with ESMTPSA id 36sm4555659qgp.8.2015.08.21.08.31.27 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 21 Aug 2015 08:31:38 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2098\)) Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Scott Long In-Reply-To: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 16:31:13 +0100 Cc: Ryan Stone , Adrian Chadd , freebsd-current , "freebsd-arch@freebsd.org" , Konstantin Belousov Content-Transfer-Encoding: quoted-printable Message-Id: References: <55D74193.4020008@FreeBSD.org> To: Eric van Gyzen X-Mailer: Apple Mail (2.2098) X-Mailman-Approved-At: Fri, 21 Aug 2015 15:44:37 +0000 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:31:41 -0000 I might have a fix for this, I=E2=80=99ll check the netflix repo and see = if it=E2=80=99s something that is ready to go upstream to freebsd. Scott > On Aug 21, 2015, at 4:19 PM, Eric van Gyzen = wrote: >=20 > I mentioned this to Adrian, but I'll mention here for everyone else's = benefit. >=20 > Ryan is exactly right. There was a thread a while ago, with a = proposed patch from Kostik: >=20 > https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html >=20 > As I recall, Scott Long also ran into this a few months ago. >=20 > It happens for any NMI: entering the debugger, a PCI Parity or System = Error, a hardware watchdog timeout, and probably other sources I'm not = remembering. >=20 > Eric >=20 > On 08/21/2015 09:23, Ryan Stone wrote: >> I have seen similar behaviour before. The problem is that every CPU >> receives an NMI concurrently. As I recall, one of them gets some = kind of >> pseudo-spinlock and tries to stop the other CPUs with an NMI. = However, >> because they are already in an NMI handler, they don't get the second = NMI >> and don't stop properly. >>=20 >> The case that I saw actually had to do with a panic triggered by an = NMI, >> not entering the debugger, but I believe that both cases use >> stop_cpus_hard() under the hood and have a similar issue. >>=20 >> (I also recall seeing the exact situation that you describe while >> originally developing SR-IOV on an alpha version of the Fortville = hardware >> and firmware with a very buggy SR-IOV implementation. I've never = seen it >> on ixgbe before, although I haven't used SR-IOV there very much at = all) >>=20 >>=20 >> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd = wrote: >>=20 >>> Hi! >>>=20 >>> This has started happening on -HEAD recently. No, I don't have any >>> more details yet than "recently." >>>=20 >>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>> have any ideas? >>>=20 >>>=20 >>> -adrian >=20