Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Apr 2006 14:36:16 -0400
From:      John Baldwin <jhb@freebsd.org>
To:        freebsd-hackers@freebsd.org, matthew@digitalstratum.com
Subject:   Re: FreeBSD Crash without Errors, Warnings, or Panics
Message-ID:  <200604131436.17942.jhb@freebsd.org>
In-Reply-To: <443E95C1.4030404@digitalstratum.com>
References:  <443E95C1.4030404@digitalstratum.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 13 April 2006 14:17, Matthew Hagerty wrote:
> Greetings,
> 
> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel 
> motherboard with a LSILogic MegaRAID (amr0) controller.  This machine 
> has been running for about 2 years now, and was very stable until I 
> updated from 5.3 to 5.4, and now 6.0.  The crashing seems to be totally 
> random and I have had it crash in as little as 12 hours and as long as 
> 143 days.
> 
> When the box goes down it does so in a strange way.  First, it still 
> responds to network probes like ping (usually), however, all console 
> access is ignored.  Also, some network ports still respond, like a 
> telnet to port 22 to test SSH will yield an SSH banner, but trying to 
> connect with SSH just hangs.  Sometimes this is also true of the SMTP 
> server, but not always.  This also makes it impossible for me to use 
> CARP to swap to the recently purchased spare machine, since the network 
> interface is generally still responding so CARP does not detect a problem.
> 
> My biggest problem with this is that there are *never* any console 
> messages or log entries in any logs, no warnings about disk failure, 
> buffer exhaustion, system failures, etc..  The machine simply seems to 
> stop responding and the only way to correct the problem is a hard reboot.
> 
> A strange thing did happen yesterday though, I believe I caught the box 
> on the verge of failure.  I was SSH'd in and did a ps to check things 
> out.  There were about 100 of these entries:
> 
> 55050  ??  D      0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres)
> 
> The box runs a web-based app and connects to a local Postgres DB which 
> seemed to be unable to start new connections being requested by the PHP 
> scripts.  At any rate, I stopped Apache and then tried to stop Postgres 
> which resulted in (or just happened to coincide with) the box locking up 
> and no longer responding to my SSH commands or attempts to reconnect 
> with SSH.  I hardly think this is a Postgres problem, but even if it 
> was, a userland app should *not* be able to bring down a box...
> 
> Can anyone shed some light on this, give me some options to try?  What 
> happened to kernel panics and such when there were serious errors going 
> on?  The only glimmer of information I have is that *one* time there was 
> an error on the console about there not being any RAID controller 
> available.  I did purchase a spare controller and I'm about to swap it 
> out and see if it helps, but for some reason I doubt it.  If a 
> controller like that was failing, I would certainly hope to see some 
> serious error messages or panics going on.
> 
> I have been running FreeBSD since version 1.01 and have never had a box 
> so unstable in the last 12 or so years, especially one that is supposed 
> to be "server" quality instead of the make-shift ones I put together 
> with desktop hardware.  And last, I'm getting sick of my Linux admin 
> friends telling me "told you so!  should have run Linux...", please give 
> me something to stick in their pie holes!

It sounds like a livelock (or deadlock) more than a crash.  Can you add
'DDB' in your kernel config and break into the debugger when it hangs
and grab the output of 'ps'?

-- 
John Baldwin <jhb@FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200604131436.17942.jhb>