Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 25 Nov 2012 19:05:43 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        freebsd-hackers@freebsd.org, Ryan Stone <rysto32@gmail.com>
Subject:   Re: stop_cpus_hard when multiple CPUs are panicking from an NMI
Message-ID:  <CAJ-FndA5SfW7EXLX8DQezYPxBeavv0ET%2BfTMj%2BPhsQhnKK=X9g@mail.gmail.com>
In-Reply-To: <50B22DA0.9080207@FreeBSD.org>
References:  <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com> <50A5F12C.1050902@FreeBSD.org> <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com> <50A63D1D.9090500@FreeBSD.org> <CAJ-FndDC1QCytXDJqVkism_5VoLNo_OzZxNEQ9NHx63HC=GTNg@mail.gmail.com> <50A65208.4050804@FreeBSD.org> <CAJ-FndADxJtYPX2-cQnqJoLhzYtJMidG1DPPY%2B6Dtf4rVw_zrw@mail.gmail.com> <50B21545.5060807@FreeBSD.org> <CAJ-FndDVt8VRA4kQipT5Lm%2Bo2KRRum9NKWorfeAucwR=hJ0uDw@mail.gmail.com> <50B22DA0.9080207@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Sun, Nov 25, 2012 at 3:39 PM, Andriy Gapon <avg@freebsd.org> wrote:
> on 25/11/2012 16:01 Attilio Rao said the following:
>> On Sun, Nov 25, 2012 at 12:55 PM, Andriy Gapon <avg@freebsd.org> wrote:
>>> on 25/11/2012 14:29 Attilio Rao said the following:
>>>> I think the patch you propose makes such effects even worse, because
>>>> it disables interrupts in generic_stop_cpus().
>>>> What I suggest to do, is the following:
>>>> - The CPU which wins the race for generic_stop_cpus also signals the
>>>> CPUs it is willing to stop on a global mask
>>>> - Another CPU entering generic_stop_cpus() and loosing the race,
>>>> checks the mask of cpus which might be stopped and stops itself if
>>>> necessary (ie. not yet done). We must be careful with races here, but
>>>> I'm confindent this can be done easily enough.
>>>
>>> I think that you either misunderstood my patch or I misunderstand your
>>> suggestion, because my patch does exactly what you wrote above.
>>
>> The patch is someway incomplete:
>> - I don't think that we need specific checks in cpustop_handler() (and
>> if you have added them to prevent races, I don't think they are
>> enough, see below)
>
> The check is to cover the following scenario:
> - two CPUs race in hard-stop scenario, and both are in NMI contexts

Please consider also another type of (similar race): two CPUs race in
(normal) stop scenario and the loser is with interrupt disabled. I
don't this will be completely fatal, but I'm sure it can lead to
issues (double stop, etc.).

> - one CPU wins and the other does spinning right in generic_stop_cpus
> - NMI for the spinning CPU is blocked and remains pending
> - the winning CPU releases all other CPUs
> - the spinning CPU exits the NMI context and get the NMI which was pending
>
>> - setting of "stopping_cpus" map must happen atomically/before the
>> stopper_cpu cpuid setting, otherwise some CPUs may end up using a NULL
>> mask in the check
>
> Not NULL, but empty or stale.  But a very good point, I agree.
> The logic must be redone.
>
>> - Did you consider the races about when a stop and restart request
>> happen just after the CPU_ISSET() check? I think CPUs can deadlock
>> there.
>
> Yeah, good point again.  This seems to be a different side of the issue above.
> stopping_cpus is probably a bad idea.
>
>> - I'm very doubious about the spinlock_enter() stuff, I think I can
>> just make the problem worse atm.
>
> Well, this is where I disagree.  I think that cpu_stop must already be called in
> context which effectively disable pre-emption and interrupts.
> The added spinlock_enter() stuff is kind of a safety belt to make things more
> explicit, but it could be changed into some sort of an assert if that's possible.

Maybe it can be come really safe once we take care of all the races
involved and reported above. I reserve to suspend my judgement until
we don't care of the other races.

>
>> However you are right, the concept of your patch is the same I really
>> wanted to get, we maybe need to just lift it up a bit.
>
> I agree.
>
> To add some gas to the fire.  Do you recall my wish to drop the mask parameter
> from the stop calls?  If that was done then it would be simpler to handle these
> things.  In that case only "stopper_cpu" ID (master/winner) would be needed.

If you really want to do something like that please rename
s/generic_stop_cpus/generic_stop_butself() or similar convention and I
may be not opposed to it.

Thanks,
Attilio


-- 
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndA5SfW7EXLX8DQezYPxBeavv0ET%2BfTMj%2BPhsQhnKK=X9g>