Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Dec 2012 20:17:31 +0000
From:      Attilio Rao <attilio@freebsd.org>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        svn-src-head@freebsd.org, svn-src-all@freebsd.org, src-committers@freebsd.org
Subject:   Re: svn commit: r243515 - head/sys/kern
Message-ID:  <CAJ-FndAsjZSK9XGFhHvdcD5135Xo4acybPDtZ0J9jXeAMpH5%2BQ@mail.gmail.com>
In-Reply-To: <50C9B525.2060503@FreeBSD.org>
References:  <201211251422.qAPEM8BV074656@svn.freebsd.org> <CAJ-FndCGe=DtqKxRe0YXV0GJrf4CV6MX9B1MR-Uyy6A3hpongA@mail.gmail.com> <50C9B525.2060503@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, Dec 13, 2012 at 10:59 AM, Andriy Gapon <avg@freebsd.org> wrote:
> on 09/12/2012 19:27 Attilio Rao said the following:
>> On Sun, Nov 25, 2012 at 2:22 PM, Andriy Gapon <avg@freebsd.org> wrote:
>>> Author: avg
>>> Date: Sun Nov 25 14:22:08 2012
>>> New Revision: 243515
>>> URL: http://svnweb.freebsd.org/changeset/base/243515
>>>
>>> Log:
>>>   remove stop_scheduler_on_panic knob
>>>
>>>   There has not been any complaints about the default behavior, so there
>>>   is no need to keep a knob that enables the worse alternative.
>>>
>>>   Now that the hard-stopping of other CPUs is the only behavior, the panic_cpu
>>>   spinlock-like logic can be dropped, because only a single CPU is
>>>   supposed to win stop_cpus_hard(other_cpus) race and proceed past that
>>>   call.
>>
>> While this is true for the sane case, for the case report by Ryan this
>> still breaks.
>
> Yes.  I haven't got around to start fixing the Ryan's problem yet.
> But this commit should reduce number of places where changes have to be made.
> In fact, I think that only stop_cpus_X would have to be fixed now.
>
>> Infact, immagine CPU0 (winner) and CPU1 (looser) both panic'ing. CPU0
>> wins and then sets stopping_cpu. When the deadlock happens in the
>> spinning loop, because of generic_stop_cpus() logic CPU0 won't
>> deadlock and will correctly continue, but the problem is that it sets
>> back stopping_cpu to NOCPU, letting CPU1 continuing too and then
>> deadlocking.
>>
>> At the minimum, what I think that should happen is to have the check
>> in panic() as prior this change but with the add I outlined (thus we
>> need to generalize cpustop_handler()). However, it seems to me that
>> generic_stop_cpus() may still be broken by this and we eventually need
>> to fix it.
>>
>> I would then revert this part of the patch and fix it appropriately.
>> Later we can better discuss the generic_stop_cpus() similar race.
>
> I actually see this change and the Ryan's problem as orthogonal issues.
> My opinion is let's just fix generic_stop_cpus().

Right, but as I said, for the time being we can at least have a
correct panic() semantic and take the right time to fix the
generic_stop_cpus() and then absorb also the panic() fix into it.
Right now the mechanism is still broken in panic and it can be fixed
with a very easy fix, so we should just do it.
This will also help vendors like Sandvine which may have hit just this bug too.

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndAsjZSK9XGFhHvdcD5135Xo4acybPDtZ0J9jXeAMpH5%2BQ>