Date: Thu, 15 Nov 2012 17:58:25 -0500 From: Ryan Stone <rysto32@gmail.com> To: "freebsd-hackers@freebsd.org" <freebsd-hackers@freebsd.org> Subject: stop_cpus_hard when multiple CPUs are panicking from an NMI Message-ID: <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com>
next in thread | raw e-mail | index | archive | help
At work we have some custom watchdog hardware that sends an NMI upon expiry. We've modified the kernel to panic when it receives the watchdog NMI. I've been trying the "stop_scheduler_on_panic" mode, and I've discovered that when my watchdog expires, the system gets completely wedged. After some digging, I've discovered is that I have multiple CPUs getting the watchdog NMI and trying to panic concurrently. One of the CPUs wins, and the rest spin forever in this code: /* * We don't want multiple CPU's to panic at the same time, so we * use panic_cpu as a simple spinlock. We have to keep checking * panic_cpu if we are spinning in case the panic on the first * CPU is canceled. */ if (panic_cpu != PCPU_GET(cpuid)) while (atomic_cmpset_int(&panic_cpu, NOCPU, PCPU_GET(cpuid)) == 0) while (panic_cpu != NOCPU) ; /* nothing */ The system wedges when stop_cpus_hard() is called, which sends NMIs to all of the other CPUs and waits for them to acknowledge that they are stopped before returning. However the CPU will not deliver an NMI to a CPU that is already handling an NMI, so the other CPUs that got a watchdog NMI and are spinning will never go into the NMI handler and acknowledge that they are stopped. I've been able to work around this with the following hideous hack: --- kern_shutdown.c 2012-08-17 10:25:02.000000000 -0400 +++ kern_shutdown.c 2012-11-15 17:04:10.000000000 -0500 @@ -658,11 +658,15 @@ * panic_cpu if we are spinning in case the panic on the first * CPU is canceled. */ - if (panic_cpu != PCPU_GET(cpuid)) + if (panic_cpu != PCPU_GET(cpuid)) { while (atomic_cmpset_int(&panic_cpu, NOCPU, - PCPU_GET(cpuid)) == 0) + PCPU_GET(cpuid)) == 0) { + atomic_set_int(&stopped_cpus, PCPU_GET(cpumask)); while (panic_cpu != NOCPU) ; /* nothing */ + } + atomic_clear_int(&stopped_cpus, PCPU_GET(cpumask)); + } if (stop_scheduler_on_panic) { if (panicstr == NULL && !kdb_active) But I'm hoping that somebody has some ideas on a better way to fix this kind of problem.
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg>