From owner-freebsd-current@freebsd.org Fri Aug 21 15:25:28 2015 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2BCD39BF7F8; Fri, 21 Aug 2015 15:25:28 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: from mail-ig0-x22b.google.com (mail-ig0-x22b.google.com [IPv6:2607:f8b0:4001:c05::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id EA661D96; Fri, 21 Aug 2015 15:25:27 +0000 (UTC) (envelope-from adrian.chadd@gmail.com) Received: by igcse8 with SMTP id se8so2556526igc.1; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:in-reply-to:references:date:message-id:subject :from:to:cc:content-type; bh=X4uZtVPkb9LhtlE7ioIifO+YWVaMEHNjk+VElTGPoNA=; b=DUj+A+Ts9d6yzbvPjeK0WH6U2MnhoQZCgDwQnIepKzmK2wjhZoUjVB7qWfY4eMZkuO it9oemnPI5SkaJukI3kzvQehKh4dmunsJw4KROi2HA2+FiEW34OqkRMzyfM949OrSvYR tXYklNtq8I8ZamFASkElAdsP9UWfYgf+kygn2BnmpYY8dRR/vpbg5RXl+18F1HXHV25N R5HSb4NXPWU04SdH1KRNGtP0lU7izDqgqb/E0i+OLY1VJ8F3Y/fDJTrtZefQPMOrskfJ HDAXG2YP45KaZ9lPevhMJk6GSHoQvovVHsDuH5k3B1PhooNZHlfRsRjdx2G/fsbr80Me KfWA== MIME-Version: 1.0 X-Received: by 10.50.128.169 with SMTP id np9mr3223564igb.37.1440170727275; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) Sender: adrian.chadd@gmail.com Received: by 10.36.38.133 with HTTP; Fri, 21 Aug 2015 08:25:27 -0700 (PDT) In-Reply-To: <55D74193.4020008@FreeBSD.org> References: <55D74193.4020008@FreeBSD.org> Date: Fri, 21 Aug 2015 08:25:27 -0700 X-Google-Sender-Auth: NlJDAIsUGthreT7rUOt8WiaZWQU Message-ID: Subject: Re: freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs? From: Adrian Chadd To: Eric van Gyzen Cc: Ryan Stone , freebsd-current , "freebsd-arch@freebsd.org" , Scott Long , Konstantin Belousov Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.20 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 21 Aug 2015 15:25:28 -0000 Ah, cool. I'll give it a whirl. I'm a little worried about having all of the other cores spinning in this case (mostly thermal; the machines get VERY LOUD when the CPUs are spinning..) -a On 21 August 2015 at 08:19, Eric van Gyzen wrote: > I mentioned this to Adrian, but I'll mention here for everyone else's benefit. > > Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik: > > https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html > > As I recall, Scott Long also ran into this a few months ago. > > It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering. > > Eric > > On 08/21/2015 09:23, Ryan Stone wrote: >> I have seen similar behaviour before. The problem is that every CPU >> receives an NMI concurrently. As I recall, one of them gets some kind of >> pseudo-spinlock and tries to stop the other CPUs with an NMI. However, >> because they are already in an NMI handler, they don't get the second NMI >> and don't stop properly. >> >> The case that I saw actually had to do with a panic triggered by an NMI, >> not entering the debugger, but I believe that both cases use >> stop_cpus_hard() under the hood and have a similar issue. >> >> (I also recall seeing the exact situation that you describe while >> originally developing SR-IOV on an alpha version of the Fortville hardware >> and firmware with a very buggy SR-IOV implementation. I've never seen it >> on ixgbe before, although I haven't used SR-IOV there very much at all) >> >> >> On Thu, Aug 20, 2015 at 6:15 PM, Adrian Chadd wrote: >> >>> Hi! >>> >>> This has started happening on -HEAD recently. No, I don't have any >>> more details yet than "recently." >>> >>> Whenever I get an NMI panic (and getting an NMI is a separate issue, >>> sigh) I get a slew of "failed to stop cpu" messages, and all CPUs >>> enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone >>> have any ideas? >>> >>> >>> -adrian >