From owner-freebsd-hackers@FreeBSD.ORG  Thu Apr 13 19:14:55 2006
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
X-Original-To: freebsd-hackers@freebsd.org
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125])
	by hub.freebsd.org (Postfix) with ESMTP id 3C74916A400
	for <freebsd-hackers@freebsd.org>; Thu, 13 Apr 2006 19:14:55 +0000 (UTC)
	(envelope-from matthew@digitalstratum.com)
Received: from mail.mundomateo.com (static-24-56-193-117.chrlmi.cablespeed.com
	[24.56.193.117])
	by mx1.FreeBSD.org (Postfix) with ESMTP id D2A1243D45
	for <freebsd-hackers@freebsd.org>; Thu, 13 Apr 2006 19:14:54 +0000 (GMT)
	(envelope-from matthew@digitalstratum.com)
Received: from [10.0.81.12] (unknown [10.0.81.1])
	by mail.mundomateo.com (Postfix) with ESMTP id 56EB62844D;
	Thu, 13 Apr 2006 15:14:54 -0400 (EDT)
Message-ID: <443EA32B.408@digitalstratum.com>
Date: Thu, 13 Apr 2006 15:14:51 -0400
From: Matthew Hagerty <matthew@digitalstratum.com>
Organization: Digital Stratum
User-Agent: Thunderbird 1.5 (Windows/20051201)
MIME-Version: 1.0
To: Alex Zbyslaw <xfb52@dial.pipex.com>
References: <443E95C1.4030404@digitalstratum.com> <443E9C38.709@dial.pipex.com>
In-Reply-To: <443E9C38.709@dial.pipex.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: freebsd-hackers@freebsd.org
Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
Reply-To: matthew@digitalstratum.com
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 13 Apr 2006 19:14:55 -0000

Alex Zbyslaw wrote:
> Matthew Hagerty wrote:
>
>> Can anyone shed some light on this, give me some options to try?  
>> What happened to kernel panics and such when there were serious 
>> errors going on?  The only glimmer of information I have is that 
>> *one* time there was an error on the console about there not being 
>> any RAID controller available.  I did purchase a spare controller and 
>> I'm about to swap it out and see if it helps, but for some reason I 
>> doubt it.  If a controller like that was failing, I would certainly 
>> hope to see some serious error messages or panics going on.
>>
>> I have been running FreeBSD since version 1.01 and have never had a 
>> box so unstable in the last 12 or so years, especially one that is 
>> supposed to be "server" quality instead of the make-shift ones I put 
>> together with desktop hardware.  And last, I'm getting sick of my 
>> Linux admin friends telling me "told you so!  should have run 
>> Linux...", please give me something to stick in their pie holes!
>
> Several times now I have had Linux servers (and production quality 
> ones, not built by me ones :-)) die in a somewhat similar fashion.  In 
> every case the cause has been either a flaky disk or a flaky disk 
> controller, or some combination.
>
> What seems to happen is that the disk is entirely "lost" by the OS.  
> At that point any process which never accesses the disk (i.e. is 
> already in memory) is able to run but the moment any process tries to 
> access the disk it locks up.  So you can't ssh in to the server, but 
> if you happen to be logged in, you shell is probably cached and keeps 
> working.  If you typed ls recently, you can run ls (but see nothing or 
> get a cryptic error message like I/O Error), for example.
>
> Clearly nothing is logged as the disk has gone AWOL.  Often the 
> machines behaved fine after a reboot and then did the same some time 
> later.  In one case, the supposedly transparent "RAID-1" array was 
> completely broken, but Linux logged precisely nothing to tell you :-(  
> You can stick that where you like in your Linux friends :-O
>
> This somewhat fits with your symptoms.  If the disk vanished, then all 
> those postgres processes would probably fail unless everything they 
> needed happened to be cached in RAM.  The Web server and PHP scripts 
> probably are cached in RAM if they are called frequently so you might 
> well see lots of postgres processes stacked up.
>
> LSI MegaRAID has a CLI of sorts in sysutils/megarc.  You might start 
> with that (and check the RAID BIOS next time the machine reboots).
>
> I'd say that if you have an alternative RAID controller that would be 
> a good place to start.  If LSI do any stndalone diagnostics, you could 
> try those.
>
> --Alex
>
> PS Kernel's usually panic when some internal state is just too wrong 
> to continue.  A disk or even a controller disappearing isn't going to 
> make the internal state wrong - it's just a device gone missing - so I 
> would not be surprised if the machine just locked up.
>

Hmm, that just seems odd that a disk controller just vanishing would not 
cause some sort of console message?  Even if the disk device is gone, 
/dev/console should still be intact to display an error, no?  Also, a 
disk device that is all of a sudden missing seems pretty serious to me, 
since a disk is one of the main devices that modern OSes cannot run 
without (generally speaking.)  I would think *some* console message 
should be warranted.

I'll see if there are any diag programs for the controller and I'll go 
ahead and swap the controller out.  I wonder if the RAID configuration 
in stored in the controller or on the disks?  I'd hate to have to 
rebuild the server install...

Thanks for the info.
Matthew