Date: Fri, 16 Nov 2012 15:18:21 +0200 From: Andriy Gapon <avg@FreeBSD.org> To: attilio@FreeBSD.org Cc: freebsd-hackers@FreeBSD.org, Ryan Stone <rysto32@gmail.com> Subject: Re: stop_cpus_hard when multiple CPUs are panicking from an NMI Message-ID: <50A63D1D.9090500@FreeBSD.org> In-Reply-To: <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com> References: <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com> <50A5F12C.1050902@FreeBSD.org> <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
on 16/11/2012 14:30 Attilio Rao said the following: > On Fri, Nov 16, 2012 at 7:54 AM, Andriy Gapon <avg@freebsd.org> wrote: >> on 16/11/2012 00:58 Ryan Stone said the following: >>> At work we have some custom watchdog hardware that sends an NMI upon >>> expiry. We've modified the kernel to panic when it receives the watchdog >>> NMI. I've been trying the "stop_scheduler_on_panic" mode, and I've >>> discovered that when my watchdog expires, the system gets completely >>> wedged. After some digging, I've discovered is that I have multiple CPUs >>> getting the watchdog NMI and trying to panic concurrently. One of the CPUs >>> wins, and the rest spin forever in this code: >>> >>> /* >>> * We don't want multiple CPU's to panic at the same time, so we >>> * use panic_cpu as a simple spinlock. We have to keep checking >>> * panic_cpu if we are spinning in case the panic on the first >>> * CPU is canceled. >>> */ >>> if (panic_cpu != PCPU_GET(cpuid)) >>> while (atomic_cmpset_int(&panic_cpu, NOCPU, >>> PCPU_GET(cpuid)) == 0) >>> while (panic_cpu != NOCPU) >>> ; /* nothing */ >>> >>> The system wedges when stop_cpus_hard() is called, which sends NMIs to all >>> of the other CPUs and waits for them to acknowledge that they are stopped >>> before returning. However the CPU will not deliver an NMI to a CPU that is >>> already handling an NMI, so the other CPUs that got a watchdog NMI and are >>> spinning will never go into the NMI handler and acknowledge that they are >>> stopped. >> >> I thought about this issue and fixed (in my tree) in a different way: >> http://people.freebsd.org/~avg/cpu_stop-race.diff >> >> The need for spinlock_enter in the patch in not entirely clear. >> The main idea is that a CPU which calls cpu_stop and loses a race should >> voluntary enter cpustop_handler. >> I am also not sure about MI-cleanness of this patch. > > It is similar to what I propose but with some differences: > - It is not clean from MI perspective OK. > - I don't think we need to treact it specially, I would just > unconditionally stop all the CPUs entering in the "spinlock zone", > making the patch simpler. I couldn't understand this part. > So I guess you are really fine with the proposal I made? I definitely agree with with the MI cpustop_handler part. I couldn't understand the rest of the proposal. -- Andriy Gapon
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A63D1D.9090500>