From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 13 19:06:00 2006 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id F040B16A402; Thu, 13 Apr 2006 19:05:59 +0000 (UTC) (envelope-from matthew@digitalstratum.com) Received: from mail.mundomateo.com (static-24-56-193-117.chrlmi.cablespeed.com [24.56.193.117]) by mx1.FreeBSD.org (Postfix) with ESMTP id 942D743D45; Thu, 13 Apr 2006 19:05:59 +0000 (GMT) (envelope-from matthew@digitalstratum.com) Received: from [10.0.81.12] (unknown [10.0.81.1]) by mail.mundomateo.com (Postfix) with ESMTP id A173A2844D; Thu, 13 Apr 2006 15:05:58 -0400 (EDT) Message-ID: <443EA113.10205@digitalstratum.com> Date: Thu, 13 Apr 2006 15:05:55 -0400 From: Matthew Hagerty Organization: Digital Stratum User-Agent: Thunderbird 1.5 (Windows/20051201) MIME-Version: 1.0 To: John Baldwin References: <443E95C1.4030404@digitalstratum.com> <200604131436.17942.jhb@freebsd.org> In-Reply-To: <200604131436.17942.jhb@freebsd.org> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: matthew@digitalstratum.com List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2006 19:06:00 -0000 John Baldwin wrote: > On Thursday 13 April 2006 14:17, Matthew Hagerty wrote: > >> Greetings, >> >> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel >> motherboard with a LSILogic MegaRAID (amr0) controller. This machine >> has been running for about 2 years now, and was very stable until I >> updated from 5.3 to 5.4, and now 6.0. The crashing seems to be totally >> random and I have had it crash in as little as 12 hours and as long as >> 143 days. >> >> When the box goes down it does so in a strange way. First, it still >> responds to network probes like ping (usually), however, all console >> access is ignored. Also, some network ports still respond, like a >> telnet to port 22 to test SSH will yield an SSH banner, but trying to >> connect with SSH just hangs. Sometimes this is also true of the SMTP >> server, but not always. This also makes it impossible for me to use >> CARP to swap to the recently purchased spare machine, since the network >> interface is generally still responding so CARP does not detect a problem. >> >> My biggest problem with this is that there are *never* any console >> messages or log entries in any logs, no warnings about disk failure, >> buffer exhaustion, system failures, etc.. The machine simply seems to >> stop responding and the only way to correct the problem is a hard reboot. >> >> A strange thing did happen yesterday though, I believe I caught the box >> on the verge of failure. I was SSH'd in and did a ps to check things >> out. There were about 100 of these entries: >> >> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres) >> >> The box runs a web-based app and connects to a local Postgres DB which >> seemed to be unable to start new connections being requested by the PHP >> scripts. At any rate, I stopped Apache and then tried to stop Postgres >> which resulted in (or just happened to coincide with) the box locking up >> and no longer responding to my SSH commands or attempts to reconnect >> with SSH. I hardly think this is a Postgres problem, but even if it >> was, a userland app should *not* be able to bring down a box... >> >> Can anyone shed some light on this, give me some options to try? What >> happened to kernel panics and such when there were serious errors going >> on? The only glimmer of information I have is that *one* time there was >> an error on the console about there not being any RAID controller >> available. I did purchase a spare controller and I'm about to swap it >> out and see if it helps, but for some reason I doubt it. If a >> controller like that was failing, I would certainly hope to see some >> serious error messages or panics going on. >> >> I have been running FreeBSD since version 1.01 and have never had a box >> so unstable in the last 12 or so years, especially one that is supposed >> to be "server" quality instead of the make-shift ones I put together >> with desktop hardware. And last, I'm getting sick of my Linux admin >> friends telling me "told you so! should have run Linux...", please give >> me something to stick in their pie holes! >> > > It sounds like a livelock (or deadlock) more than a crash. Can you add > 'DDB' in your kernel config and break into the debugger when it hangs > and grab the output of 'ps'? > > I can probably figure out how to compile in DDB (I've never done if before though), but just two questions: 1. How do I break into DDB and grab the ps output? 2. How can I login if the box is not responding to SSH or the console? It was only by sheer luck that I caught it yesterday just before the lockup, I have never been able to do that before. Thanks, Matthew