From owner-freebsd-hackers@FreeBSD.ORG  Thu Apr 13 19:06:00 2006
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
X-Original-To: freebsd-hackers@freebsd.org
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id F040B16A402;
	Thu, 13 Apr 2006 19:05:59 +0000 (UTC)
	(envelope-from matthew@digitalstratum.com)
Received: from mail.mundomateo.com (static-24-56-193-117.chrlmi.cablespeed.com
	[24.56.193.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id 942D743D45;
	Thu, 13 Apr 2006 19:05:59 +0000 (GMT)
	(envelope-from matthew@digitalstratum.com)
Received: from [10.0.81.12] (unknown [10.0.81.1])
	by mail.mundomateo.com (Postfix) with ESMTP id A173A2844D;
	Thu, 13 Apr 2006 15:05:58 -0400 (EDT)
Message-ID: <443EA113.10205@digitalstratum.com>
Date: Thu, 13 Apr 2006 15:05:55 -0400
From: Matthew Hagerty <matthew@digitalstratum.com>
Organization: Digital Stratum
User-Agent: Thunderbird 1.5 (Windows/20051201)
MIME-Version: 1.0
To: John Baldwin <jhb@freebsd.org>
References: <443E95C1.4030404@digitalstratum.com>
	<200604131436.17942.jhb@freebsd.org>
In-Reply-To: <200604131436.17942.jhb@freebsd.org>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers@freebsd.org
Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: matthew@digitalstratum.com
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Apr 2006 19:06:00 -0000

John Baldwin wrote:
> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
>   
>> Greetings,
>>
>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel 
>> motherboard with a LSILogic MegaRAID (amr0) controller.  This machine 
>> has been running for about 2 years now, and was very stable until I 
>> updated from 5.3 to 5.4, and now 6.0.  The crashing seems to be totally 
>> random and I have had it crash in as little as 12 hours and as long as 
>> 143 days.
>>
>> When the box goes down it does so in a strange way.  First, it still 
>> responds to network probes like ping (usually), however, all console 
>> access is ignored.  Also, some network ports still respond, like a 
>> telnet to port 22 to test SSH will yield an SSH banner, but trying to 
>> connect with SSH just hangs.  Sometimes this is also true of the SMTP 
>> server, but not always.  This also makes it impossible for me to use 
>> CARP to swap to the recently purchased spare machine, since the network 
>> interface is generally still responding so CARP does not detect a problem.
>>
>> My biggest problem with this is that there are *never* any console 
>> messages or log entries in any logs, no warnings about disk failure, 
>> buffer exhaustion, system failures, etc..  The machine simply seems to 
>> stop responding and the only way to correct the problem is a hard reboot.
>>
>> A strange thing did happen yesterday though, I believe I caught the box 
>> on the verge of failure.  I was SSH'd in and did a ps to check things 
>> out.  There were about 100 of these entries:
>>
>> 55050  ??  D      0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres)
>>
>> The box runs a web-based app and connects to a local Postgres DB which 
>> seemed to be unable to start new connections being requested by the PHP 
>> scripts.  At any rate, I stopped Apache and then tried to stop Postgres 
>> which resulted in (or just happened to coincide with) the box locking up 
>> and no longer responding to my SSH commands or attempts to reconnect 
>> with SSH.  I hardly think this is a Postgres problem, but even if it 
>> was, a userland app should *not* be able to bring down a box...
>>
>> Can anyone shed some light on this, give me some options to try?  What 
>> happened to kernel panics and such when there were serious errors going 
>> on?  The only glimmer of information I have is that *one* time there was 
>> an error on the console about there not being any RAID controller 
>> available.  I did purchase a spare controller and I'm about to swap it 
>> out and see if it helps, but for some reason I doubt it.  If a 
>> controller like that was failing, I would certainly hope to see some 
>> serious error messages or panics going on.
>>
>> I have been running FreeBSD since version 1.01 and have never had a box 
>> so unstable in the last 12 or so years, especially one that is supposed 
>> to be "server" quality instead of the make-shift ones I put together 
>> with desktop hardware.  And last, I'm getting sick of my Linux admin 
>> friends telling me "told you so!  should have run Linux...", please give 
>> me something to stick in their pie holes!
>>     
>
> It sounds like a livelock (or deadlock) more than a crash.  Can you add
> 'DDB' in your kernel config and break into the debugger when it hangs
> and grab the output of 'ps'?
>
>   
I can probably figure out how to compile in DDB (I've never done if 
before though), but just two questions:

1. How do I break into DDB and grab the ps output?

2. How can I login if the box is not responding to SSH or the console?  
It was only by sheer luck that I caught it yesterday just before the 
lockup, I have never been able to do that before.

Thanks,
Matthew