From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 13 19:34:14 2006 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AF7F416A403 for ; Thu, 13 Apr 2006 19:34:14 +0000 (UTC) (envelope-from jhb@freebsd.org) Received: from server.baldwin.cx (66-23-211-162.clients.speedfactory.net [66.23.211.162]) by mx1.FreeBSD.org (Postfix) with ESMTP id 8C47343D48 for ; Thu, 13 Apr 2006 19:34:13 +0000 (GMT) (envelope-from jhb@freebsd.org) Received: from localhost (john@localhost [127.0.0.1]) by server.baldwin.cx (8.13.4/8.13.4) with ESMTP id k3DJY572039145; Thu, 13 Apr 2006 15:34:05 -0400 (EDT) (envelope-from jhb@freebsd.org) From: John Baldwin To: Julian Elischer Date: Thu, 13 Apr 2006 15:33:41 -0400 User-Agent: KMail/1.9.1 References: <443E95C1.4030404@digitalstratum.com> <443EA113.10205@digitalstratum.com> <443EA35B.4030909@elischer.org> In-Reply-To: <443EA35B.4030909@elischer.org> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200604131533.44115.jhb@freebsd.org> X-Virus-Scanned: ClamAV 0.87.1/1396/Thu Apr 13 01:39:53 2006 on server.baldwin.cx X-Virus-Status: Clean X-Spam-Status: No, score=-4.0 required=4.2 tests=ALL_TRUSTED,AWL,BAYES_00 autolearn=ham version=3.1.0 X-Spam-Checker-Version: SpamAssassin 3.1.0 (2005-09-13) on server.baldwin.cx Cc: matthew@digitalstratum.com, freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2006 19:34:14 -0000 On Thursday 13 April 2006 15:15, Julian Elischer wrote: > Matthew Hagerty wrote: > > > John Baldwin wrote: > > > >> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote: > >> > >> > >>> Greetings, > >>> > >>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon > >>> Intel motherboard with a LSILogic MegaRAID (amr0) controller. This > >>> machine has been running for about 2 years now, and was very stable > >>> until I updated from 5.3 to 5.4, and now 6.0. The crashing seems to > >>> be totally random and I have had it crash in as little as 12 hours > >>> and as long as 143 days. > >>> > >>> When the box goes down it does so in a strange way. First, it still > >>> responds to network probes like ping (usually), however, all console > >>> access is ignored. Also, some network ports still respond, like a > >>> telnet to port 22 to test SSH will yield an SSH banner, but trying > >>> to connect with SSH just hangs. Sometimes this is also true of the > >>> SMTP server, but not always. This also makes it impossible for me > >>> to use CARP to swap to the recently purchased spare machine, since > >>> the network interface is generally still responding so CARP does not > >>> detect a problem. > >>> > >>> My biggest problem with this is that there are *never* any console > >>> messages or log entries in any logs, no warnings about disk failure, > >>> buffer exhaustion, system failures, etc.. The machine simply seems > >>> to stop responding and the only way to correct the problem is a hard > >>> reboot. > >>> > >>> A strange thing did happen yesterday though, I believe I caught the > >>> box on the verge of failure. I was SSH'd in and did a ps to check > >>> things out. There were about 100 of these entries: > >>> > >>> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup > >>> (postgres) > >>> > >>> The box runs a web-based app and connects to a local Postgres DB > >>> which seemed to be unable to start new connections being requested > >>> by the PHP scripts. At any rate, I stopped Apache and then tried to > >>> stop Postgres which resulted in (or just happened to coincide with) > >>> the box locking up and no longer responding to my SSH commands or > >>> attempts to reconnect with SSH. I hardly think this is a Postgres > >>> problem, but even if it was, a userland app should *not* be able to > >>> bring down a box... > >>> > >>> Can anyone shed some light on this, give me some options to try? > >>> What happened to kernel panics and such when there were serious > >>> errors going on? The only glimmer of information I have is that > >>> *one* time there was an error on the console about there not being > >>> any RAID controller available. I did purchase a spare controller > >>> and I'm about to swap it out and see if it helps, but for some > >>> reason I doubt it. If a controller like that was failing, I would > >>> certainly hope to see some serious error messages or panics going on. > >>> > >>> I have been running FreeBSD since version 1.01 and have never had a > >>> box so unstable in the last 12 or so years, especially one that is > >>> supposed to be "server" quality instead of the make-shift ones I put > >>> together with desktop hardware. And last, I'm getting sick of my > >>> Linux admin friends telling me "told you so! should have run > >>> Linux...", please give me something to stick in their pie holes! > >>> > >> > >> > >> It sounds like a livelock (or deadlock) more than a crash. Can you add > >> 'DDB' in your kernel config and break into the debugger when it hangs > >> and grab the output of 'ps'? > >> > >> > > > > I can probably figure out how to compile in DDB (I've never done if > > before though), but just two questions: > > > add > options DDB > to your kenrnel config file. > > > > > 1. How do I break into DDB and grab the ps output? > > on the console, hit keys (at once) > > that should put you into the debugger.. > > then 'ps' will give you some output. > > It's a lot to write down but I've found a camera phone makes good enough > snapshots :-) > > alternatively you can use a serial console, but getting into the > debugger is harder, > you have to have compiled in ALT_BREAK_TO_DEBUGGER > into your kernel by adding > > # Solaris implements a new BREAK which is initiated by a character > # sequence CR ~ ^b which is similar to a familiar pattern used on > # Sun servers by the Remote Console. > options ALT_BREAK_TO_DEBUGGER > > to the kernel config file you are using.. Or jsut use 'options BREAK_TO_DEBUGGER' and send a serial break signal to break into the debugger. Matthew, There's also a chapter in the handbook that explains how to use ddb, setup a serial console, etc. -- John Baldwin <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" = http://www.FreeBSD.org