Skip site navigation (1)Skip section navigation (2)
Date:      Sun, 21 Aug 2011 02:19:58 +0100
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Andriy Gapon" <avg@FreeBSD.org>, <jamie@freebsd.org>
Cc:        freebsd-jail@FreeBSD.org, freebsd-stable@FreeBSD.org
Subject:   Re: debugging frequent kernel panics on 8.2-RELEASE
Message-ID:  <3D52B19B71CA49A4BB3BCCDF25961F46@multiplay.co.uk>
References:  eBSD.org><82E865FBA30747078AF6EE3C1701F973@multiplay.co.uk><4E4FE55A.9000101@FreeBSD.org> <B367F6CE151B4B6A8B0E048022D7F8AB@multiplay.co.uk><D34B2BC140E14E8A8D450474CCC39CB8@multiplay.co.uk> <4E501A6A.3030801@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
----- Original Message ----- 
From: "Andriy Gapon" <avg@FreeBSD.org>

> on 20/08/2011 23:24 Steven Hartland said the following:
>> ----- Original Message ----- From: "Steven Hartland"
>>> Looking through the code I believe I may have noticed a scenario which could
>>> trigger the problem.
>>>
>>> Given the following code:-
>>>
>>> static void
>>> prison_deref(struct prison *pr, int flags)
>>> {
>>>    struct prison *ppr, *tpr;
>>>    int vfslocked;
>>>
>>>    if (!(flags & PD_LOCKED))
>>>        mtx_lock(&pr->pr_mtx);
>>>    /* Decrement the user references in a separate loop. */
>>>    if (flags & PD_DEUREF) {
>>>        for (tpr = pr;; tpr = tpr->pr_parent) {
>>>            if (tpr != pr)
>>>                mtx_lock(&tpr->pr_mtx);
>>>            if (--tpr->pr_uref > 0)
>>>                break;
>>>            KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
>>>            mtx_unlock(&tpr->pr_mtx);
>>>        }
>>>        /* Done if there were only user references to remove. */
>>>        if (!(flags & PD_DEREF)) {
>>>            mtx_unlock(&tpr->pr_mtx);
>>>            if (flags & PD_LIST_SLOCKED)
>>>                sx_sunlock(&allprison_lock);
>>>            else if (flags & PD_LIST_XLOCKED)
>>>                sx_xunlock(&allprison_lock);
>>>            return;
>>>        }
>>>        if (tpr != pr) {
>>>            mtx_unlock(&tpr->pr_mtx);
>>>            mtx_lock(&pr->pr_mtx);
>>>        }
>>>    }
>>>
>>> If you take a scenario of a simple one level prison setup running a single
>>> process
>>> where a prison has just been stopped.
>>>
>>> In the above code pr_uref of the processes prison is decremented. As this is the
>>> last process then pr_uref will hit 0 and the loop continues instead of breaking
>>> early.
>>>
>>> Now at the end of the loop iteration the mtx is unlocked so other process can
>>> now manipulate the jail, this is where I think the problem may be.
>>>
>>> If we now have another process come in and attach to the jail but then instantly
>>> exit, this process may allow another kernel thread to hit this same bit of code
>>> and so two process for the same prison get into the section which decrements
>>> prison0's pr_uref, instead of only one.
>>>
>>> In essence I think we can get the following flow where 1# = process1
>>> and 2# = process2
>>> 1#1. prison1.pr_uref = 1 (single process jail)
>>> 1#2. prison_deref( prison1,...
>>> 1#3. prison1.pr_uref-- (prison1.pr_uref = 0)
>>> 1#3. prison1.mtx_unlock <-- this now allows others to change prison1.pr_uref
>>> 1#3. prison0.pr_uref--
>>> 2#1. process1.attach( prison1 ) (prison1.pr_uref = 1)
>>> 2#2. process1.exit
>>> 2#3. prison_deref( prison1,...
>>> 2#4. prison1.pr_uref-- (prison1.pr_uref = 0)
>>> 2#5. prison1.mtx_unlock <-- this now allows others to change prison1.pr_uref
>>> 2#5. prison0.pr_uref-- (prison1.pr_ref has now been decremented twice by prison1)
>>>
>>> It seems like the action on the parent prison to decrement the pr_uref is
>>> happening too early, while the jail can still be used and without the lock on
>>> the child jails mtx, so causing a race condition.
>>>
>>> I think the fix is to the move the decrement of parent prison pr_uref's down
>>> so it only takes place if the jail is "really" being removed. Either that or
>>> to change the locking semantics so that once the lock is aquired in this
>>> prison_deref its not unlocked until the function completes.
>>>
>>> What do people think?
>> 
>> After reviewing the changes to prison_deref in commit which added hierarchical
>> jails, the removal of the lock by the inital loop on the passed in prison may
>> be unintentional.
>> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_jail.c.diff?r1=1.101;r2=1.102;f=h
>> 
>> 
>> If so the following may be all that's needed to fix this issue:-
>> 
>> diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c
>> --- sys/kern/kern_jail.c.orig   2011-08-20 21:17:14.856618854 +0100
>> +++ sys/kern/kern_jail.c        2011-08-20 21:18:35.307201425 +0100
>> @@ -2455,7 +2455,8 @@
>>                        if (--tpr->pr_uref > 0)
>>                                break;
>>                        KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
>> -                       mtx_unlock(&tpr->pr_mtx);
>> +                       if (tpr != pr)
>> +                               mtx_unlock(&tpr->pr_mtx);
>>                }
>>                /* Done if there were only user references to remove. */
>>                if (!(flags & PD_DEREF)) {
> 
> Not sure if this would fly as is - please double check the later block where
> pr->pr_mtx is re-locked.

Your right, and its actually more complex than that. Although changing it to
not unlock in the middle of prison_deref fixes that race condition it doesn't
prevent pr_uref being incorrectly decremented each time the jail gets into
the dying state, which is really the problem we are seeing.

If hierarchical prisons are used there seems to be an additional problem
where the counter of all prisons in the hierarchy are decremented, but as
far as I can tell only the immediate parent is ever incremented, so another
reference problem there as well I think.

The following patch I believe fixes both of these issues.

I've testing with debug added and confirmed prison0's pr_uref is maintained
correctly even when a jail hits dying state multiple times.

It essentially reverts the changes to the "if (flags & PD_DEUREF)" by
192895 and moves it to after the jail has been actually removed.

diff -u sys/kern/kern_jail.c.orig sys/kern/kern_jail.c
--- sys/kern/kern_jail.c.orig   2011-08-20 21:17:14.856618854 +0100
+++ sys/kern/kern_jail.c        2011-08-21 01:56:58.429894825 +0100
@@ -2449,27 +2449,16 @@
                mtx_lock(&pr->pr_mtx);
        /* Decrement the user references in a separate loop. */
        if (flags & PD_DEUREF) {
-               for (tpr = pr;; tpr = tpr->pr_parent) {
-                       if (tpr != pr)
-                               mtx_lock(&tpr->pr_mtx);
-                       if (--tpr->pr_uref > 0)
-                               break;
-                       KASSERT(tpr != &prison0, ("prison0 pr_uref=0"));
-                       mtx_unlock(&tpr->pr_mtx);
-               }
+               pr->pr_uref--;
                /* Done if there were only user references to remove. */
                if (!(flags & PD_DEREF)) {
-                       mtx_unlock(&tpr->pr_mtx);
+                       mtx_unlock(&pr->pr_mtx);
                        if (flags & PD_LIST_SLOCKED)
                                sx_sunlock(&allprison_lock);
                        else if (flags & PD_LIST_XLOCKED)
                                sx_xunlock(&allprison_lock);
                        return;
                }
-               if (tpr != pr) {
-                       mtx_unlock(&tpr->pr_mtx);
-                       mtx_lock(&pr->pr_mtx);
-               }
        }
 
        for (;;) {
@@ -2525,6 +2514,8 @@
                /* Removing a prison frees a reference on its parent. */
                pr = ppr;
                mtx_lock(&pr->pr_mtx);
+               /* Ensure user reference added on create is removed */
+               pr->pr_uref--;
                flags = PD_DEREF;
        }
 }

Jamie from what I can tell you where the original committer of hierarchical
prisons and this in the following svn change set so would really appreciate
your feedback on this.
http://svnweb.freebsd.org/base?view=revision&revision=192895

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3D52B19B71CA49A4BB3BCCDF25961F46>