From owner-freebsd-hackers@FreeBSD.ORG Mon Sep 19 19:34:19 2005 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 5482216A41F; Mon, 19 Sep 2005 19:34:19 +0000 (GMT) (envelope-from SRS0=bNIOtS2l=XV=metro.cx=fbsd@sonologic.nl) Received: from mx1.sonologic.nl (mx1.sonologic.nl [82.94.245.21]) by mx1.FreeBSD.org (Postfix) with ESMTP id B61AD43D45; Mon, 19 Sep 2005 19:34:18 +0000 (GMT) (envelope-from SRS0=bNIOtS2l=XV=metro.cx=fbsd@sonologic.nl) Received: from [10.1.4.2] (sonolo.xs4all.nl [80.126.206.91]) (authenticated bits=0) by mx1.sonologic.nl (8.13.3/8.13.3) with ESMTP id j8JJYDYR014330; Mon, 19 Sep 2005 19:34:14 GMT Message-ID: <432F1310.80007@metro.cx> Date: Mon, 19 Sep 2005 21:35:44 +0200 From: Koen Martens Organization: Sonologic User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050317 Thunderbird/1.0.2 Mnenhy/0.7.2.0 X-Accept-Language: en-us, en MIME-Version: 1.0 To: Vinod Kashyap References: <2B3B2AA816369A4E87D7BE63EC9D2F269B7B4D@SDCEXCHANGE01.ad.amcc.com> In-Reply-To: <2B3B2AA816369A4E87D7BE63EC9D2F269B7B4D@SDCEXCHANGE01.ad.amcc.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Helo-Milter-Authen: gmc@sonologic.nl, fbsd@metro.cx, mx1 Received-SPF: pass (mx1.sonologic.nl: 80.126.206.91 is authenticated by a trusted mechanism) Cc: freebsd-hackers@freebsd.org, Dimitry Andric Subject: Re: panic in propagate_priority w/ postgresql under heavy load X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 19 Sep 2005 19:34:19 -0000 Vinod Kashyap wrote: > You seem to be booting off of a 9000 (twa) controller and not 7000/8000 > (twe). > It could be because of a 9000 firmware bug that you are not being able > to > get the dump. The firmware wrongly interprets physical address 0x0 as > invalid > during dumps, and fails the operations. This bug will be fixed in > future > firmware releases. Ok, it's been a while, here is an update on this. I ran a heavily instrumented kernel for two weeks on the server, it did not crash in that time. I then took out the witness and kdb/ddb stuff, because the decreased performance was a bit of a nuisance, however i retained the ability to obtain a crash dump. I had to limit physical memory, put it on 1.8GB in loader.conf:hw.physmem because swap and physmem are both 2GB. Tested with 'reboot -d' gave me a core dump. Without the debug stuff in the kernel, it crashed within 2 days, same story: postgresql process, function propagate_priority. However, no dump was written to disk :( Furthermore, i've been seeing the same crash (in propagate_priority) on another box in mysql processes. Both servers seem to panic every 2-3 days. I have another server of the exact same hardware configuration, but it is mainly idling most of the time. Haven't seen that one crash yet. I am thinking now that it is a bug in the twa driver, so i'll have to dig in to that. Furthermore, it seems to have to do with some sort of concurrency issue or otherwise timing-sensitive issue, because slowing the kernel down with debug code seems to avoid the panic. But, as i am completely new to the freebsd kernel and don't even know what turnstiles are, i imagine i will have a hard time. So if anyone can offer some help, please :) Ok, thanks for your attention, Koen -- K.F.J. Martens, Sonologic, http://www.sonologic.nl/ Networking, hosting, embedded systems, unix, artificial intelligence. Public PGP key: http://www.metro.cx/pubkey-gmc.asc Wondering about the funny attachment your mail program can't read? Visit http://www.openpgp.org/