Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 18 Aug 2011 00:15:56 +0100
From:      "Steven Hartland" <killing@multiplay.co.uk>
To:        "Andriy Gapon" <avg@FreeBSD.org>, <freebsd-jail@FreeBSD.org>, <freebsd-hackers@FreeBSD.org>
Cc:        freebsd-stable@FreeBSD.org
Subject:   Re: debugging frequent kernel panics on 8.2-RELEASE
Message-ID:  <4019027648B5493AAC4B654BD821DE88@multiplay.co.uk>
References:  <47F0D04ADF034695BC8B0AC166553371@multiplay.co.uk><A71C3ACF01EC4D36871E49805C1A5321@multiplay.co.uk><4E4380C0.7070908@FreeBSD.org><EBC06A239BAB4B3293C28D793329F9CA@multiplay.co.uk><4E43E272.1060204@FreeBSD.org><62BF25D0ED914876BEE75E2ADF28DDF7@multiplay.co.uk><4E440865.1040500@FreeBSD.org><6F08A8DE780545ADB9FA93B0A8AA4DA1@multiplay.co.uk><4E441314.6060606@FreeBSD.org><2C4B0D05C8924F24A73B56EA652FA4B0@multiplay.co.uk><4E48D967.9060804@FreeBSD.org><9D034F992B064E8092E5D1D249B3E959@multiplay.co.uk><4E490DAF.1080009@FreeBSD.org><796FD5A096DE4558B57338A8FA1E125B@multiplay.co.uk><4E491D01.1090902@FreeBSD.org><570C5495A5E242F7946E806CA7AC5D68@multiplay.co.uk><4E4AD35C.7020504@FreeBSD.org><6A7238AED44542A880B082A40304D940@multiplay.co.uk><4E4BA21F.6010805@FreeBSD.org><581C95046B0948FC82D6F2E86948F87B@multiplay.co.uk><4E4BBA7F.30907@FreeBSD.org><88A6CE3E8B174E0694A3A9A5283479B4@multiplay.co.uk> <4E4C22D6.6070407@FreeBSD.org>

next in thread | previous in thread | raw e-mail | index | archive | help
----- Original Message ----- 
From: "Andriy Gapon" <avg@FreeBSD.org>

> Thanks to the debug that Steven provided and to the help that I received from
> Kostik, I think that now I understand the basic mechanics of this panic, but,
> unfortunately, not the details of its root cause.
> 
> It seems like everything starts with some kind of a race between terminating
> processes in a jail and termination of the jail itself.  This is where the
> details are very thin so far.  What we see is that a process (http) is in
> exit(2) syscall, in exit1() function actually, and past the place where P_WEXIT
> flag is set and even past the place where p_limit is freed and reset to NULL.
> At that place the thread calls prison_proc_free(), which calls prison_deref().
> Then, we see that in prison_deref() the thread gets a page fault because of what
> seems like a NULL pointer dereference.  That's just the start of the problem and
> its root cause.

Thats interesting, are you using http as an example or is that something thats
been gleaned from the debugging of our output? I ask as there's only one process
running in each of our jails and thats a single java process.

Now given your description there may be something I can add that may help
clarify what the cause could be.

In a nutshell the jail manager we're using will attempt to resurrect the jail
from a dieing state in a few specific scenarios.

Here's an exmaple:-
1. jail restart requested
2. jail is stopped, so the java processes is killed off, but active tcp sessions
may prevent the timely full shutdown of the jail.
3. if an existing jail is detected, i.e. a dieing jail from #2, instead of
starting a new jail we attach to the old one and exec the new java process.
4. if an existing jail isnt detected, i.e. where there where not hanging tcp
sessions and #2 cleanly shutdown the jail, a new jail is created, attached to
and the java exec'ed.

The system uses static jailid's so its possible to determine if an existing
jail for this "service" exists or not. This prevents duplicate services as
well as making services easy to identify by their jailid.

So what we could be seeing is a race between the jail shutdown and the attach
of the new process?

Now man 2 jail seems to indicate this is a valid use case for jail_set, as
it documents its support for JAIL_DYING as a valid option for flags, but I
suspect its something quite out of the ordinary to actually do, which may be
why this panic hasnt been seen before now.

As some background the reason we use static jailid's is to ensure only one
instance of the jailed service is running, and the reason we re-attach to
the dieing jail is so that jails can be restarted in a timely manor. Without
using the re-attach we would need to wait of all tcp sessions which have
been aborted to timeout.

> So, of course, Steven is interested in finding and fixing the root cause.  I
> hope we will get to that with some help from the "prison guards" :-)

Does the above potentially explain how we're getting to the situation
which generates the panic?

If so we can certainly look at using alternatives to the current design to
workaround this issue. Flagging the jail as permanent and using manual process
management and additional external locking to prevent duplicates, is what
instantly springs to mind.

    Regards
    Steve

================================================
This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 

In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337
or return the E.mail to postmaster@multiplay.co.uk.




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4019027648B5493AAC4B654BD821DE88>