From owner-freebsd-stable@FreeBSD.ORG Wed Aug 17 23:15:54 2011 Return-Path: Delivered-To: freebsd-stable@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 016FF106564A; Wed, 17 Aug 2011 23:15:54 +0000 (UTC) (envelope-from prvs=1210f20b9f=killing@multiplay.co.uk) Received: from mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) by mx1.freebsd.org (Postfix) with ESMTP id E01B98FC1B; Wed, 17 Aug 2011 23:15:52 +0000 (UTC) X-MDAV-Processed: mail1.multiplay.co.uk, Thu, 18 Aug 2011 00:15:17 +0100 X-Spam-Processed: mail1.multiplay.co.uk, Thu, 18 Aug 2011 00:15:17 +0100 X-Spam-Checker-Version: SpamAssassin 3.2.5 (2008-06-10) on mail1.multiplay.co.uk X-Spam-Level: X-Spam-Status: No, score=-5.0 required=6.0 tests=USER_IN_WHITELIST shortcircuit=ham autolearn=disabled version=3.2.5 Received: from r2d2 ([188.220.16.49]) by mail1.multiplay.co.uk (mail1.multiplay.co.uk [85.236.96.23]) (MDaemon PRO v10.0.4) with ESMTP id md50014640704.msg; Thu, 18 Aug 2011 00:15:16 +0100 X-MDRemoteIP: 188.220.16.49 X-Return-Path: prvs=1210f20b9f=killing@multiplay.co.uk X-Envelope-From: killing@multiplay.co.uk Message-ID: <4019027648B5493AAC4B654BD821DE88@multiplay.co.uk> From: "Steven Hartland" To: "Andriy Gapon" , , References: <47F0D04ADF034695BC8B0AC166553371@multiplay.co.uk><4E4380C0.7070908@FreeBSD.org><4E43E272.1060204@FreeBSD.org><62BF25D0ED914876BEE75E2ADF28DDF7@multiplay.co.uk><4E440865.1040500@FreeBSD.org><6F08A8DE780545ADB9FA93B0A8AA4DA1@multiplay.co.uk><4E441314.6060606@FreeBSD.org><2C4B0D05C8924F24A73B56EA652FA4B0@multiplay.co.uk><4E48D967.9060804@FreeBSD.org><9D034F992B064E8092E5D1D249B3E959@multiplay.co.uk><4E490DAF.1080009@FreeBSD.org><796FD5A096DE4558B57338A8FA1E125B@multiplay.co.uk><4E491D01.1090902@FreeBSD.org><570C5495A5E242F7946E806CA7AC5D68@multiplay.co.uk><4E4AD35C.7020504@FreeBSD.org><6A7238AED44542A880B082A40304D940@multiplay.co.uk><4E4BA21F.6010805@FreeBSD.org><581C95046B0948FC82D6F2E86948F87B@multiplay.co.uk><4E4BBA7F.30907@FreeBSD.org><88A6CE3E8B174E0694A3A9A5283479B4@multiplay.co.uk> <4E4C22D6.6070407@FreeBSD.org> Date: Thu, 18 Aug 2011 00:15:56 +0100 MIME-Version: 1.0 Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2900.5931 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.6109 Cc: freebsd-stable@FreeBSD.org Subject: Re: debugging frequent kernel panics on 8.2-RELEASE X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 17 Aug 2011 23:15:54 -0000 ----- Original Message ----- From: "Andriy Gapon" > Thanks to the debug that Steven provided and to the help that I received from > Kostik, I think that now I understand the basic mechanics of this panic, but, > unfortunately, not the details of its root cause. > > It seems like everything starts with some kind of a race between terminating > processes in a jail and termination of the jail itself. This is where the > details are very thin so far. What we see is that a process (http) is in > exit(2) syscall, in exit1() function actually, and past the place where P_WEXIT > flag is set and even past the place where p_limit is freed and reset to NULL. > At that place the thread calls prison_proc_free(), which calls prison_deref(). > Then, we see that in prison_deref() the thread gets a page fault because of what > seems like a NULL pointer dereference. That's just the start of the problem and > its root cause. Thats interesting, are you using http as an example or is that something thats been gleaned from the debugging of our output? I ask as there's only one process running in each of our jails and thats a single java process. Now given your description there may be something I can add that may help clarify what the cause could be. In a nutshell the jail manager we're using will attempt to resurrect the jail from a dieing state in a few specific scenarios. Here's an exmaple:- 1. jail restart requested 2. jail is stopped, so the java processes is killed off, but active tcp sessions may prevent the timely full shutdown of the jail. 3. if an existing jail is detected, i.e. a dieing jail from #2, instead of starting a new jail we attach to the old one and exec the new java process. 4. if an existing jail isnt detected, i.e. where there where not hanging tcp sessions and #2 cleanly shutdown the jail, a new jail is created, attached to and the java exec'ed. The system uses static jailid's so its possible to determine if an existing jail for this "service" exists or not. This prevents duplicate services as well as making services easy to identify by their jailid. So what we could be seeing is a race between the jail shutdown and the attach of the new process? Now man 2 jail seems to indicate this is a valid use case for jail_set, as it documents its support for JAIL_DYING as a valid option for flags, but I suspect its something quite out of the ordinary to actually do, which may be why this panic hasnt been seen before now. As some background the reason we use static jailid's is to ensure only one instance of the jailed service is running, and the reason we re-attach to the dieing jail is so that jails can be restarted in a timely manor. Without using the re-attach we would need to wait of all tcp sessions which have been aborted to timeout. > So, of course, Steven is interested in finding and fixing the root cause. I > hope we will get to that with some help from the "prison guards" :-) Does the above potentially explain how we're getting to the situation which generates the panic? If so we can certainly look at using alternatives to the current design to workaround this issue. Flagging the jail as permanent and using manual process management and additional external locking to prevent duplicates, is what instantly springs to mind. Regards Steve ================================================ This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to postmaster@multiplay.co.uk.