From owner-freebsd-hackers@FreeBSD.ORG Fri Nov 16 00:16:51 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 93CDF212 for ; Fri, 16 Nov 2012 00:16:51 +0000 (UTC) (envelope-from asmrookie@gmail.com) Received: from mail-la0-f54.google.com (mail-la0-f54.google.com [209.85.215.54]) by mx1.freebsd.org (Postfix) with ESMTP id 1105D8FC12 for ; Fri, 16 Nov 2012 00:16:50 +0000 (UTC) Received: by mail-la0-f54.google.com with SMTP id j13so2118281lah.13 for ; Thu, 15 Nov 2012 16:16:49 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:reply-to:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:cc:content-type; bh=bCnbZxqND4BjKr9hu2AODHA5N9Hcpr2vkMQiRNYMCxY=; b=j6sCZZYT0oYn/4WIQQn111BRNiy5A1CKgvxJxqjdhEVmpy1tmwMH4ASCaSxTn6ytL4 FaGRB6EQt83iYVvY6SmilO+ZRcLN8GQXJJ8MesuDE0Y1eQKHWYjvuXtevvyJ8WwCGBfk uEp6dFwU6RJ5niD8/WfEfiRWz7pYEjOwEe/2mql2N/BWxZXNdhAE5h8v4Hy5FxKo1uOZ /l7AOex5jNV1u+lOV+3k6T7KxMet5JJ1mISbCEQoXNUPYb9vvJ1KEddw+6jdgM/E5Pm7 roDZM1Hz6H+LQDWDWnMclDcN1HXrvhNwY6x/7Ul+fbhQRF/A8Q7l+ni1n93bX8raRJLH CkHg== MIME-Version: 1.0 Received: by 10.112.41.36 with SMTP id c4mr1223096lbl.75.1353025009797; Thu, 15 Nov 2012 16:16:49 -0800 (PST) Sender: asmrookie@gmail.com Received: by 10.112.134.5 with HTTP; Thu, 15 Nov 2012 16:16:49 -0800 (PST) In-Reply-To: References: Date: Fri, 16 Nov 2012 00:16:49 +0000 X-Google-Sender-Auth: AQ0D6VyKFsB2h3CRKIOZwZo51sc Message-ID: Subject: Re: stop_cpus_hard when multiple CPUs are panicking from an NMI From: Attilio Rao To: Ryan Stone Content-Type: text/plain; charset=UTF-8 Cc: "freebsd-hackers@freebsd.org" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: attilio@FreeBSD.org List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 16 Nov 2012 00:16:51 -0000 On Thu, Nov 15, 2012 at 11:47 PM, Ryan Stone wrote: > On Thu, Nov 15, 2012 at 6:41 PM, Attilio Rao wrote: >> >> On Thu, Nov 15, 2012 at 10:58 PM, Ryan Stone wrote: >> > At work we have some custom watchdog hardware that sends an NMI upon >> > expiry. We've modified the kernel to panic when it receives the >> > watchdog >> > NMI. I've been trying the "stop_scheduler_on_panic" mode, and I've >> > discovered that when my watchdog expires, the system gets completely >> > wedged. After some digging, I've discovered is that I have multiple >> > CPUs >> > getting the watchdog NMI and trying to panic concurrently. One of the >> > CPUs >> > wins, and the rest spin forever in this code: >> >> Quick question: can you control the way your watchdog sends the NMI? >> Like only to BSP rather than broadcast, etc. >> This is tied to the very unique situation that you cannot really >> deliver the (second) NMI. >> >> Attilio >> >> >> -- >> Peace can only be achieved by understanding - A. Einstein > > > I don't believe that I can, but I can check. In any case I can imagine > other places where this could be an issue. hwpmc works with NMIs, right? > So an hwpmc bug could trigger the same kind of issues if two CPUs that > concurrently called pmc_intr both tripped over the sane bug. Frankly, I think that what you were trying to do is someway the right approach, modulo a clean interface. I don't understand why the "spinlock" does wants to spin forever as it can never recover. Stopping the cpus that gets into the "spinlock" is perfectly fine. There are only 2 things to consider: 1) I think we need a new KPI for that, a function in $arch/include/cpu.h that does take care to stop a CPU in MI way, so for example cpu_self_stop(). This needs to be implemented for all the architectures but it can be done easily because it will be what cpustop_handler() and similar functions do, basically. 2) The "fake spinlock" path will call such functions. The only thing to debeate IMHO is if we want to do that conditional to stop_scheduler_on_panic or not. If I have to be honest, stopping the CPU seems the best approach in any case to me, but I'm open to hear what you think. Comments? Attilio -- Peace can only be achieved by understanding - A. Einstein