From owner-freebsd-hackers@FreeBSD.ORG Fri Nov 4 10:26:54 2005 Return-Path: X-Original-To: freebsd-hackers@FreeBSD.org Delivered-To: freebsd-hackers@FreeBSD.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id E346016A420; Fri, 4 Nov 2005 10:26:53 +0000 (GMT) (envelope-from SRS0=iBhji9T3=ZD=metro.cx=fbsd@sonologic.nl) Received: from mx1.sonologic.nl (mx1.sonologic.nl [82.94.245.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id ECAD843D45; Fri, 4 Nov 2005 10:26:52 +0000 (GMT) (envelope-from SRS0=iBhji9T3=ZD=metro.cx=fbsd@sonologic.nl) Received: from [127.0.0.1] (mx1.sonologic.nl [82.94.245.21]) (authenticated bits=0) by mx1.sonologic.nl (8.13.3/8.13.3) with ESMTP id jA4AQdKl030225; Fri, 4 Nov 2005 10:26:39 GMT Message-ID: <436B36E1.7010704@metro.cx> Date: Fri, 04 Nov 2005 11:24:33 +0100 From: Koen Martens Organization: Sonologic User-Agent: Mozilla Thunderbird 1.0.2 (X11/20050317) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Robert Watson References: <2B3B2AA816369A4E87D7BE63EC9D2F269B7B4D@SDCEXCHANGE01.ad.amcc.com> <432F1310.80007@metro.cx> <20050920153806.F34322@fledge.watson.org> <433FF87C.3090101@metro.cx> <20051005090715.D84936@fledge.watson.org> In-Reply-To: <20051005090715.D84936@fledge.watson.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Helo-Milter-Authen: gmc@sonologic.nl, fbsd@metro.cx, mx1 Received-SPF: pass (mx1.sonologic.nl: 82.94.245.21 is authenticated by a trusted mechanism) Cc: Koen Martens , freebsd-hackers@FreeBSD.org, Dimitry Andric , Vinod Kashyap , jhb@FreeBSD.org Subject: Re: panic in propagate_priority w/ postgresql under heavy load X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 04 Nov 2005 10:26:54 -0000 Robert Watson wrote: > > On Sun, 2 Oct 2005, Koen Martens wrote: > >> kernel trap 12 with interrupts disabled >> >> >> Fatal trap 12: page fault while in kernel mode >> cpuid = 1; apic id = 06 >> fault virtual address = 0x24 >> fault code = supervisor read, page not present >> instruction pointer = 0x8:0xc051c253 >> stack pointer = 0x10:0xe93efb3c >> frame pointer = 0x10:0xe93efb50 >> code segment = base 0x0, limit 0xfffff, type 0x1b >> = DPL 0, pres 1, def32 1, gran 1 >> processor eflags = resume, IOPL = 0 >> current process = 6092 (postgres) >> >> And that, that is all.. No ddb> no 'dumping xxxxMB', just that. So >> basically, i fear this is a non-debugable problem, since putting in >> witness and such slows the kernel to a point where the panic does not >> occur anymore (at least, not in the 4 weeks i've been running the box >> with witness & invariants). Clueless :) > > > This looks like a NULL pointer dereference in kernel code. Probably, > this is not a locking problem, so running without WITNESS to debug > this should be OK. Are you using a serial console? If not, you might > find that it increases the reliability of entering DDB. If this box > is an SMP box, you may also want to add options KDB_STOP_NMI to your > kernel config. > > Using gdb, could you work out what function 0xc051c253 is, and where > in the function. You should be able to run gdb on your kernel.debug > (or kernel on 7.x), and use "l *0xc051c253" to generate a pointer to > the line and snippet, which will give us a substantial hint about what > is happening. Sorry for not getting back on this timely, have had rather a busy period (lousy excuse, i know). Anyway, I have currently downgraded the machine to a 5.3-RELEASE-p22 kernel, which seems to have solved the problem. There are no panics anymore (it has been two weeks since the downgrade). Makes me a bit warry about upgrading anything to 6.x :) Anyway, i did get into the ddb prompt on one of the last panics, and put some of the resources online: http://www.sonologic.nl/fbsd/ As you can see, i was pretty clueless about what to do, and just traced about everything that was not swapped out.. Did not put the core dump online, as i don't feel like sharing that with the world. Available upon request though for those who want to get a crack at this. I don't have a copy of the kernel.debug lying around, for which I apologise. I cannot however upgrade to 5.4 again, we've had enought trouble with this machine and the user load on that machine has increased to a point where i cannot afford these random panics anymore. I don't have the spare identical hardware lying around at this point to copy the entire setup for testing purposes.. What i will try when i find some time is doing incremental upgrades from 5.3-RELEASE-p22 to 5.4-RELEASE-p6, step by step, to see what patchlevel introduces the problem. Best, Koen