From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 13 19:14:55 2006 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id 3C74916A400 for ; Thu, 13 Apr 2006 19:14:55 +0000 (UTC) (envelope-from matthew@digitalstratum.com) Received: from mail.mundomateo.com (static-24-56-193-117.chrlmi.cablespeed.com [24.56.193.117]) by mx1.FreeBSD.org (Postfix) with ESMTP id D2A1243D45 for ; Thu, 13 Apr 2006 19:14:54 +0000 (GMT) (envelope-from matthew@digitalstratum.com) Received: from [10.0.81.12] (unknown [10.0.81.1]) by mail.mundomateo.com (Postfix) with ESMTP id 56EB62844D; Thu, 13 Apr 2006 15:14:54 -0400 (EDT) Message-ID: <443EA32B.408@digitalstratum.com> Date: Thu, 13 Apr 2006 15:14:51 -0400 From: Matthew Hagerty Organization: Digital Stratum User-Agent: Thunderbird 1.5 (Windows/20051201) MIME-Version: 1.0 To: Alex Zbyslaw References: <443E95C1.4030404@digitalstratum.com> <443E9C38.709@dial.pipex.com> In-Reply-To: <443E9C38.709@dial.pipex.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: matthew@digitalstratum.com List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2006 19:14:55 -0000 Alex Zbyslaw wrote: > Matthew Hagerty wrote: > >> Can anyone shed some light on this, give me some options to try? >> What happened to kernel panics and such when there were serious >> errors going on? The only glimmer of information I have is that >> *one* time there was an error on the console about there not being >> any RAID controller available. I did purchase a spare controller and >> I'm about to swap it out and see if it helps, but for some reason I >> doubt it. If a controller like that was failing, I would certainly >> hope to see some serious error messages or panics going on. >> >> I have been running FreeBSD since version 1.01 and have never had a >> box so unstable in the last 12 or so years, especially one that is >> supposed to be "server" quality instead of the make-shift ones I put >> together with desktop hardware. And last, I'm getting sick of my >> Linux admin friends telling me "told you so! should have run >> Linux...", please give me something to stick in their pie holes! > > Several times now I have had Linux servers (and production quality > ones, not built by me ones :-)) die in a somewhat similar fashion. In > every case the cause has been either a flaky disk or a flaky disk > controller, or some combination. > > What seems to happen is that the disk is entirely "lost" by the OS. > At that point any process which never accesses the disk (i.e. is > already in memory) is able to run but the moment any process tries to > access the disk it locks up. So you can't ssh in to the server, but > if you happen to be logged in, you shell is probably cached and keeps > working. If you typed ls recently, you can run ls (but see nothing or > get a cryptic error message like I/O Error), for example. > > Clearly nothing is logged as the disk has gone AWOL. Often the > machines behaved fine after a reboot and then did the same some time > later. In one case, the supposedly transparent "RAID-1" array was > completely broken, but Linux logged precisely nothing to tell you :-( > You can stick that where you like in your Linux friends :-O > > This somewhat fits with your symptoms. If the disk vanished, then all > those postgres processes would probably fail unless everything they > needed happened to be cached in RAM. The Web server and PHP scripts > probably are cached in RAM if they are called frequently so you might > well see lots of postgres processes stacked up. > > LSI MegaRAID has a CLI of sorts in sysutils/megarc. You might start > with that (and check the RAID BIOS next time the machine reboots). > > I'd say that if you have an alternative RAID controller that would be > a good place to start. If LSI do any stndalone diagnostics, you could > try those. > > --Alex > > PS Kernel's usually panic when some internal state is just too wrong > to continue. A disk or even a controller disappearing isn't going to > make the internal state wrong - it's just a device gone missing - so I > would not be surprised if the machine just locked up. > Hmm, that just seems odd that a disk controller just vanishing would not cause some sort of console message? Even if the disk device is gone, /dev/console should still be intact to display an error, no? Also, a disk device that is all of a sudden missing seems pretty serious to me, since a disk is one of the main devices that modern OSes cannot run without (generally speaking.) I would think *some* console message should be warranted. I'll see if there are any diag programs for the controller and I'll go ahead and swap the controller out. I wonder if the RAID configuration in stored in the controller or on the disks? I'd hate to have to rebuild the server install... Thanks for the info. Matthew