Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 13 Apr 2006 15:14:51 -0400
From:      Matthew Hagerty <matthew@digitalstratum.com>
To:        Alex Zbyslaw <xfb52@dial.pipex.com>
Cc:        freebsd-hackers@freebsd.org
Subject:   Re: FreeBSD Crash without Errors, Warnings, or Panics
Message-ID:  <443EA32B.408@digitalstratum.com>
In-Reply-To: <443E9C38.709@dial.pipex.com>
References:  <443E95C1.4030404@digitalstratum.com> <443E9C38.709@dial.pipex.com>

index | next in thread | previous in thread | raw e-mail

Alex Zbyslaw wrote:
> Matthew Hagerty wrote:
>
>> Can anyone shed some light on this, give me some options to try?  
>> What happened to kernel panics and such when there were serious 
>> errors going on?  The only glimmer of information I have is that 
>> *one* time there was an error on the console about there not being 
>> any RAID controller available.  I did purchase a spare controller and 
>> I'm about to swap it out and see if it helps, but for some reason I 
>> doubt it.  If a controller like that was failing, I would certainly 
>> hope to see some serious error messages or panics going on.
>>
>> I have been running FreeBSD since version 1.01 and have never had a 
>> box so unstable in the last 12 or so years, especially one that is 
>> supposed to be "server" quality instead of the make-shift ones I put 
>> together with desktop hardware.  And last, I'm getting sick of my 
>> Linux admin friends telling me "told you so!  should have run 
>> Linux...", please give me something to stick in their pie holes!
>
> Several times now I have had Linux servers (and production quality 
> ones, not built by me ones :-)) die in a somewhat similar fashion.  In 
> every case the cause has been either a flaky disk or a flaky disk 
> controller, or some combination.
>
> What seems to happen is that the disk is entirely "lost" by the OS.  
> At that point any process which never accesses the disk (i.e. is 
> already in memory) is able to run but the moment any process tries to 
> access the disk it locks up.  So you can't ssh in to the server, but 
> if you happen to be logged in, you shell is probably cached and keeps 
> working.  If you typed ls recently, you can run ls (but see nothing or 
> get a cryptic error message like I/O Error), for example.
>
> Clearly nothing is logged as the disk has gone AWOL.  Often the 
> machines behaved fine after a reboot and then did the same some time 
> later.  In one case, the supposedly transparent "RAID-1" array was 
> completely broken, but Linux logged precisely nothing to tell you :-(  
> You can stick that where you like in your Linux friends :-O
>
> This somewhat fits with your symptoms.  If the disk vanished, then all 
> those postgres processes would probably fail unless everything they 
> needed happened to be cached in RAM.  The Web server and PHP scripts 
> probably are cached in RAM if they are called frequently so you might 
> well see lots of postgres processes stacked up.
>
> LSI MegaRAID has a CLI of sorts in sysutils/megarc.  You might start 
> with that (and check the RAID BIOS next time the machine reboots).
>
> I'd say that if you have an alternative RAID controller that would be 
> a good place to start.  If LSI do any stndalone diagnostics, you could 
> try those.
>
> --Alex
>
> PS Kernel's usually panic when some internal state is just too wrong 
> to continue.  A disk or even a controller disappearing isn't going to 
> make the internal state wrong - it's just a device gone missing - so I 
> would not be surprised if the machine just locked up.
>

Hmm, that just seems odd that a disk controller just vanishing would not 
cause some sort of console message?  Even if the disk device is gone, 
/dev/console should still be intact to display an error, no?  Also, a 
disk device that is all of a sudden missing seems pretty serious to me, 
since a disk is one of the main devices that modern OSes cannot run 
without (generally speaking.)  I would think *some* console message 
should be warranted.

I'll see if there are any diag programs for the controller and I'll go 
ahead and swap the controller out.  I wonder if the RAID configuration 
in stored in the controller or on the disks?  I'd hate to have to 
rebuild the server install...

Thanks for the info.
Matthew




home | help

Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?443EA32B.408>