Date: Fri, 16 Nov 2012 00:16:49 +0000 From: Attilio Rao <attilio@freebsd.org> To: Ryan Stone <rysto32@gmail.com> Cc: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: Re: stop_cpus_hard when multiple CPUs are panicking from an NMI Message-ID: <CAJ-FndDZFHr4jo196sw3prYG9zQKK-LxnmY0hsXHXe5v_a%2BSFw@mail.gmail.com> In-Reply-To: <CAFMmRNx3Q_F02CnqHhYKF=HLMu=hhMVP2PhJscAydAFcQKU52w@mail.gmail.com> References: <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com> <CAJ-FndBQwO0syGpG9mSYF4tAEO8wu6vv7QKbvzQY-9uo_ZJWhA@mail.gmail.com> <CAFMmRNx3Q_F02CnqHhYKF=HLMu=hhMVP2PhJscAydAFcQKU52w@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Nov 15, 2012 at 11:47 PM, Ryan Stone <rysto32@gmail.com> wrote: > On Thu, Nov 15, 2012 at 6:41 PM, Attilio Rao <attilio@freebsd.org> wrote: >> >> On Thu, Nov 15, 2012 at 10:58 PM, Ryan Stone <rysto32@gmail.com> wrote: >> > At work we have some custom watchdog hardware that sends an NMI upon >> > expiry. We've modified the kernel to panic when it receives the >> > watchdog >> > NMI. I've been trying the "stop_scheduler_on_panic" mode, and I've >> > discovered that when my watchdog expires, the system gets completely >> > wedged. After some digging, I've discovered is that I have multiple >> > CPUs >> > getting the watchdog NMI and trying to panic concurrently. One of the >> > CPUs >> > wins, and the rest spin forever in this code: >> >> Quick question: can you control the way your watchdog sends the NMI? >> Like only to BSP rather than broadcast, etc. >> This is tied to the very unique situation that you cannot really >> deliver the (second) NMI. >> >> Attilio >> >> >> -- >> Peace can only be achieved by understanding - A. Einstein > > > I don't believe that I can, but I can check. In any case I can imagine > other places where this could be an issue. hwpmc works with NMIs, right? > So an hwpmc bug could trigger the same kind of issues if two CPUs that > concurrently called pmc_intr both tripped over the sane bug. Frankly, I think that what you were trying to do is someway the right approach, modulo a clean interface. I don't understand why the "spinlock" does wants to spin forever as it can never recover. Stopping the cpus that gets into the "spinlock" is perfectly fine. There are only 2 things to consider: 1) I think we need a new KPI for that, a function in $arch/include/cpu.h that does take care to stop a CPU in MI way, so for example cpu_self_stop(). This needs to be implemented for all the architectures but it can be done easily because it will be what cpustop_handler() and similar functions do, basically. 2) The "fake spinlock" path will call such functions. The only thing to debeate IMHO is if we want to do that conditional to stop_scheduler_on_panic or not. If I have to be honest, stopping the CPU seems the best approach in any case to me, but I'm open to hear what you think. Comments? Attilio -- Peace can only be achieved by understanding - A. Einstein
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndDZFHr4jo196sw3prYG9zQKK-LxnmY0hsXHXe5v_a%2BSFw>