From owner-freebsd-current@FreeBSD.ORG Wed Nov 4 00:17:36 2009 Return-Path: Delivered-To: freebsd-current@FreeBSD.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id EC1C81065672 for ; Wed, 4 Nov 2009 00:17:35 +0000 (UTC) (envelope-from gallasch@free.de) Received: from smtp.free.de (smtp.free.de [91.204.6.103]) by mx1.freebsd.org (Postfix) with ESMTP id 5D2398FC16 for ; Wed, 4 Nov 2009 00:17:35 +0000 (UTC) Received: (qmail 63487 invoked from network); 4 Nov 2009 01:17:33 +0100 Received: from smtp.free.de (HELO orwell.free.de) (gallasch@free.de@[91.204.4.103]) (envelope-sender ) by smtp.free.de (qmail-ldap-1.03) with AES128-SHA encrypted SMTP for ; 4 Nov 2009 01:17:33 +0100 Date: Wed, 4 Nov 2009 01:17:16 +0100 From: Kai Gallasch To: Gavin Atkinson Message-ID: <20091104011716.768baae5@orwell.free.de> In-Reply-To: <1257244960.98619.36.camel@buffy.york.ac.uk> References: <20091031231545.493cee89@boiler.free.de> <1257244960.98619.36.camel@buffy.york.ac.uk> X-Mailer: Claws Mail 3.7.0 (GTK+ 2.18.2; powerpc-apple-darwin9.7.0) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: freebsd-current@FreeBSD.org Subject: Re: 8.0RC2 amd64 - kernel panic running make buildworld X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 04 Nov 2009 00:17:36 -0000 Am Tue, 03 Nov 2009 10:42:40 +0000 schrieb Gavin Atkinson : > On Sat, 2009-10-31 at 23:15 +0100, Kai Gallasch wrote: > > Hi. > > > > I installed 8.0RC2-amd64 on an 8-core opteron server a few days ago. > > > > When I try to do a make buildworld or make buildkernel the server > > reboots without any message left in the logs. The same happens > > when building bigger ports (for example ruby18 or perl58) > First place I think I'd start id by running memtest86 on the machine > overnight. This sounds like possible hardware issue to me, it would > be good to see if we can confirm that that is the case. I will do so tomorrow. Following actions I have already taken to rule out a hardware problem: - ran several passes with diagnostic software from the manufacturer - reset BIOS settings to default - upgraded BIOS to newest release - booted server from 2 year old backup BIOS - took out the only pair of RAM modules that was different from the rest of the modules - installed freebsd 7.2-STABLE on the server to repeat the kernel panic (no panic with 7.2) - installed 8.0-BETA4 (crash) Besides: The server was in production with 7.2 for some time, without showing any such problems. > > Now my idea was to install the old 8.0-BETA4 and upgrade to RC2 > > through makeworld + buildkernel (gdb+witness). But no luck. When > > trying to upgrade to RC2 the 8.0-BETA4 also crashes. At least > > 8.0-BETA4 has debug > > + witness active in the GENERIC kernel.. > > > > So below some debug output of 8.0-BETA4 crashing. Has a vfs/ffs LOR > > problem with the BETA4 already been fixed? > > The debug output you included were just lock order reversals, and > don't seem to be related to your crash. Sorry for causing possible confusion about this. I realized this after my mail was already out. > I think 8.0-BETA4 still had the debugger compiled in (you can test by > pressing ctrl-alt-escape ion the console, if you do drop to the > debugger, give the "c" command to continue). > > If the debugger is compiled in, then the spontaneous reboot without > dropping to the debugger suggests even more that it may be hardware > related. If you do get to the debugger, a copy of all of the messages > on screen and the output of the "bt" command would be very useful. > When you do your kernel recompile, please include full debugging, > including WITNESS, INVARIANTS, KDB, DDB etc. In the meantime I managed it to install a RELENG_8 world + GENERIC kernel with all debug options enabled on the crashing server. (mounted /usr/src and /usr/obj on another server running 8.0RC1 through NFS and did buildworld + buildkernel over there..) So now I have a debug kernel available with dumpev + dumpdir defined. Here are my latest findings on this issue: - Running a makeworld in about 80% leads to a server crash without the server writing a crashdump to dumpdir. The server just reboots.. - In about 20% of the cases makeworld gets stuck in a not terminating process that eats up 100% cpu. This process cannot be killed. When restarting makeworld the server then reboots again - It makes no difference doing makeworld -j1 or -j8, result is the same > It depends what the bug is to be honest. So far there isn't really > enough information to determine the cause, and therefore there isn't > really enough info for a PR. Mark Atkinson also commented on my mail and he gave the hint: "If vm.pmap.pg_ps_enabled is 1 in 8.0-rc2, you might try rebooting with c in /boot/loader.conf and try another buildworld." So I thought why not and just tried it - and surprise: Disabling vm.pmap.pg_ps_enabled=1 in loader.conf resolves my problem with 8.0RC2 crashing when doing a makeworld.. After successful buildworld and buildkernel I rebooted the server again with commented out vm.pmap.pg_ps_enabled=0 and the problem was there again. And then I disabled the option again in loader.conf, rebooted + make buildworld .. no problem. Seems to be deterministic. With vm.pmap.pg_ps_enabled=1 the server crashes without being able to write crashdumps to dumpdev. (at least on this specific Proliant DL385G2 server) --Kai. -- You need more time; and you probably always will.