Skip site navigation (1)Skip section navigation (2)
Date:      Tue, 6 Dec 2011 16:21:47 +0100
From:      Attilio Rao <attilio@freebsd.org>
To:        Andriy Gapon <avg@freebsd.org>
Cc:        freebsd-current@freebsd.org, Konstantin Belousov <kib@freebsd.org>
Subject:   Re: Stop scheduler on panic
Message-ID:  <CAJ-FndBop%2Bi-qGGhOSxFZ9T6FYp6HGaivOCKp-NFEzfTJ9aXdw@mail.gmail.com>
In-Reply-To: <4ED951E0.9000000@FreeBSD.org>
References:  <20111113083215.GV50300@deviant.kiev.zoral.com.ua> <201112011349.50502.jhb@freebsd.org> <4ED7E6B0.30400@FreeBSD.org> <201112011553.34432.jhb@freebsd.org> <4ED7F4BC.3080206@FreeBSD.org> <4ED855E6.20207@FreeBSD.org> <4ED8A306.9020801@FreeBSD.org> <4ED8F1C1.7010206@FreeBSD.org> <CAJ-FndCBXXGG%2BihS_rVfM5TqcopHABg80U0my9PxguYY8QBD=Q@mail.gmail.com> <4ED91B8D.2080808@FreeBSD.org> <4ED951E0.9000000@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
2011/12/2 Andriy Gapon <avg@freebsd.org>:
> on 02/12/2011 20:40 John Baldwin said the following:
>> On 12/2/11 12:18 PM, Attilio Rao wrote:
>>> 2011/12/2 John Baldwin<jhb@freebsd.org>:
>>>> On 12/2/11 5:05 AM, Andriy Gapon wrote:
>>>>>
>>>>> on 02/12/2011 06:36 John Baldwin said the following:
>>>>>>
>>>>>> Ah, ok (I had thought SCHEDULER_STOPPED was going to always be true =
when
>>>>>> kdb was
>>>>>> active). =C2=A0But I think these two changes should cover critical_e=
xit() ok.
>>>>>>
>>>>>
>>>>> I attempted to start a discussion about this a few times already :-)
>>>>> Should we treat kdb context the same as SCHEDULER_STOPPED context (in=
 the
>>>>> current definition) ? =C2=A0That is, skip all locks in the same fashi=
on?
>>>>> There are pros and contras.
>>>>
>>>>
>>>> kdb should not block on locks, no. =C2=A0Most debugger commands should=
 not go
>>>> near locks anyway unless they are intended to carefully modify the exi=
sting
>>>> system in a safe manner (such as the 'kill' command which should only =
be
>>>> using try locks and fail if it cannot safely post the signal).
>>>
>>> The biggest problem to KDB as the same as panic is that doing proper
>>> 'continue' is impossible.
>>> One of the features of the 'skip-locking' path is that it doesn't take
>>> into account fast locking paths, where sometimes the lock can succeed
>>> and other fails and you don't know about them. Also the restarted CPUs
>>> can find corrupted datas (as they can be arbitrarely updated), I'm
>>> sure it is too much panic prone.
>>
>> Yes, my thought is that kdb commands, etc. should be using dedicated rou=
tines
>> that do not use locks whenever possible. =C2=A0The problem of a user
>> calling an arbitrary routine is not solvable (so I don't think we should=
 try to
>> solve that, you use 'call' at your own risk), but built-in commands shou=
ld
>> explicitly either 1) not use locking, or 2) only use try locks and fail =
out
>> cleanly (including dropping any try locks acquired) if a try fails. =C2=
=A0Now, that's
>> an ideal view, I don't know how close we are to that in practice or if i=
t is a
>> realistically attainable goal.
>>
>
>
> I agree with what Attilio and you say. =C2=A0Initially it was tempting fo=
r me to
> apply the same SCHEDULER_STOPPED stopped medicine to the kdb_active conte=
xt, but
> after trying to deal with kdb_active x SCHEDULER_STOPPED x ukbd situation=
 I
> really changed my mind.
>
>
> I would classify the code that can be called in kdb_active context as fol=
lows:
> o debugger code proper (kdb, ddb, gdb stub, etc) - this obviously must no=
t
> (doesn't have to) use any locking
>
> o code that can be invoked via 'call' command - this is essentially any c=
ode and
> I don't think that it can/should do anything special for the kdb_active c=
ontext [*]
>
> o debugger helper routines - those that do something trivial should not a=
cquire
> any locks; those that access shared resources should try the relevant loc=
ks and
> bail out if a resource can be in inconsistent state, or should be equippe=
d to
> deal correctly with such a state; this is the same as what you say above
>
> o common code that the debuggers have to use - most obviously this is con=
sole
> code and drivers that serve a particular console; on one hand those drive=
rs can
> have a non-trivial state that must be lock protected during normal operat=
ion, on
> the other hand the debugger must disregard those locks and grab its conso=
le;
> this is the most complex case in my opinion.

Thanks for summarizing this up.
However, please note that code in 2 and 4 entries may have the same
issues or being the same thing, in practice.

Anyway, I'm thinking now that if we really want to stop CPUs when
entering KDB (and I think we do) functions at 2 and 4 should basically
just be totally lockless or in general being totally re-entrant
because when we restart CPUs we don't really want them finding datas
to be corrupted. Also, skipping locking, is totally broken for this
very specific reason.

Functions at point 2 and 4 should be totally lockless then and
possibly just work on read mode. For point 2, specifically, I think we
need an explicit KPI to define functions within the subsystem
themselves (something like DB_SHOW_COMMAND()) which marks undoublty
functions to be called within ddb (the only KDB backend we implement
right now) and likely for functions at point 4 we need to find a way
to stress their belonging to the KDB area.

Attilio


--=20
Peace can only be achieved by understanding - A. Einstein



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAJ-FndBop%2Bi-qGGhOSxFZ9T6FYp6HGaivOCKp-NFEzfTJ9aXdw>