Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 25 Nov 2012 16:39:28 +0200
From:      Andriy Gapon <avg@FreeBSD.org>
To:        attilio@FreeBSD.org
Cc:        freebsd-hackers@FreeBSD.org, Ryan Stone <rysto32@gmail.com>
Subject:   Re: stop_cpus_hard when multiple CPUs are panicking from an NMI
Message-ID:  <50B22DA0.9080207@FreeBSD.org>
In-Reply-To: <CAJ-FndDVt8VRA4kQipT5Lm%2Bo2KRRum9NKWorfeAucwR=hJ0uDw@mail.gmail.com>
References:  <CAFMmRNwb_rxYXHGtXgtcyVUJnFDx5PSeMmA_crBbeV_rtzL9Cg@mail.gmail.com> <50A5F12C.1050902@FreeBSD.org> <CAJ-FndAB%2B7KRAE91L9634eXgzqgrizwtwCBC7AAg%2B0EX89TEBQ@mail.gmail.com> <50A63D1D.9090500@FreeBSD.org> <CAJ-FndDC1QCytXDJqVkism_5VoLNo_OzZxNEQ9NHx63HC=GTNg@mail.gmail.com> <50A65208.4050804@FreeBSD.org> <CAJ-FndADxJtYPX2-cQnqJoLhzYtJMidG1DPPY%2B6Dtf4rVw_zrw@mail.gmail.com> <50B21545.5060807@FreeBSD.org> <CAJ-FndDVt8VRA4kQipT5Lm%2Bo2KRRum9NKWorfeAucwR=hJ0uDw@mail.gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
on 25/11/2012 16:01 Attilio Rao said the following:
> On Sun, Nov 25, 2012 at 12:55 PM, Andriy Gapon <avg@freebsd.org> wrote:
>> on 25/11/2012 14:29 Attilio Rao said the following:
>>> I think the patch you propose makes such effects even worse, because
>>> it disables interrupts in generic_stop_cpus().
>>> What I suggest to do, is the following:
>>> - The CPU which wins the race for generic_stop_cpus also signals the
>>> CPUs it is willing to stop on a global mask
>>> - Another CPU entering generic_stop_cpus() and loosing the race,
>>> checks the mask of cpus which might be stopped and stops itself if
>>> necessary (ie. not yet done). We must be careful with races here, but
>>> I'm confindent this can be done easily enough.
>>
>> I think that you either misunderstood my patch or I misunderstand your
>> suggestion, because my patch does exactly what you wrote above.
> 
> The patch is someway incomplete:
> - I don't think that we need specific checks in cpustop_handler() (and
> if you have added them to prevent races, I don't think they are
> enough, see below)

The check is to cover the following scenario:
- two CPUs race in hard-stop scenario, and both are in NMI contexts
- one CPU wins and the other does spinning right in generic_stop_cpus
- NMI for the spinning CPU is blocked and remains pending
- the winning CPU releases all other CPUs
- the spinning CPU exits the NMI context and get the NMI which was pending

> - setting of "stopping_cpus" map must happen atomically/before the
> stopper_cpu cpuid setting, otherwise some CPUs may end up using a NULL
> mask in the check

Not NULL, but empty or stale.  But a very good point, I agree.
The logic must be redone.

> - Did you consider the races about when a stop and restart request
> happen just after the CPU_ISSET() check? I think CPUs can deadlock
> there.

Yeah, good point again.  This seems to be a different side of the issue above.
stopping_cpus is probably a bad idea.

> - I'm very doubious about the spinlock_enter() stuff, I think I can
> just make the problem worse atm.

Well, this is where I disagree.  I think that cpu_stop must already be called in
context which effectively disable pre-emption and interrupts.
The added spinlock_enter() stuff is kind of a safety belt to make things more
explicit, but it could be changed into some sort of an assert if that's possible.

> However you are right, the concept of your patch is the same I really
> wanted to get, we maybe need to just lift it up a bit.

I agree.

To add some gas to the fire.  Do you recall my wish to drop the mask parameter
from the stop calls?  If that was done then it would be simpler to handle these
things.  In that case only "stopper_cpu" ID (master/winner) would be needed.

> In the while I also double-checked suspended_cpus and I don't think
> there are real showstoppers to have it in stopped_cpus map.

-- 
Andriy Gapon



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?50B22DA0.9080207>