Skip site navigation (1)Skip section navigation (2)
Date:      Fri, 16 Nov 2012 15:18:21 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        attilio@FreeBSD.org
Cc:        freebsd-hackers@FreeBSD.org, Ryan Stone <rysto32@gmail.com>
Subject:   Re: stop_cpus_hard when multiple CPUs are panicking from an NMI
Message-ID:  <50A63D1D.9090500@FreeBSD.org>
In-Reply-To: <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com>
References:  <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com> <50A5F12C.1050902@FreeBSD.org> <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
on 16/11/2012 14:30 Attilio Rao said the following:
> On Fri, Nov 16, 2012 at 7:54 AM, Andriy Gapon <avg@freebsd.org> wrote:
>> on 16/11/2012 00:58 Ryan Stone said the following:
>>> At work we have some custom watchdog hardware that sends an NMI upon
>>> expiry.  We've modified the kernel to panic when it receives the watchdog
>>> NMI.  I've been trying the "stop_scheduler_on_panic" mode, and I've
>>> discovered that when my watchdog expires, the system gets completely
>>> wedged.  After some digging, I've discovered is that I have multiple CPUs
>>> getting the watchdog NMI and trying to panic concurrently.  One of the CPUs
>>> wins, and the rest spin forever in this code:
>>>
>>> /*
>>>      * We don't want multiple CPU's to panic at the same time, so we
>>>      * use panic_cpu as a simple spinlock.  We have to keep checking
>>>      * panic_cpu if we are spinning in case the panic on the first
>>>      * CPU is canceled.
>>>      */
>>>     if (panic_cpu != PCPU_GET(cpuid))
>>>         while (atomic_cmpset_int(&panic_cpu, NOCPU,
>>>             PCPU_GET(cpuid)) == 0)
>>>             while (panic_cpu != NOCPU)
>>>                 ; /* nothing */
>>>
>>> The system wedges when stop_cpus_hard() is called, which sends NMIs to all
>>> of the other CPUs and waits for them to acknowledge that they are stopped
>>> before returning.  However the CPU will not deliver an NMI to a CPU that is
>>> already handling an NMI, so the other CPUs that got a watchdog NMI and are
>>> spinning will never go into the NMI handler and acknowledge that they are
>>> stopped.
>>
>> I thought about this issue and fixed (in my tree) in a different way:
>> http://people.freebsd.org/~avg/cpu_stop-race.diff
>>
>> The need for spinlock_enter in the patch in not entirely clear.
>> The main idea is that a CPU which calls cpu_stop and loses a race should
>> voluntary enter cpustop_handler.
>> I am also not sure about MI-cleanness of this patch.
> 
> It is similar to what I propose but with some differences:
> - It is not clean from MI perspective

OK.

> - I don't think we need to treact it specially, I would just
> unconditionally stop all the CPUs entering in the "spinlock zone",
> making the patch simpler.

I couldn't understand this part.

> So I guess you are really fine with the proposal I made?

I definitely agree with with the MI cpustop_handler part.  I couldn't understand
the rest of the proposal.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50A63D1D.9090500>