Date: Thu, 13 Apr 2006 15:14:51 -0400 From: Matthew Hagerty <matthew@digitalstratum.com> To: Alex Zbyslaw <xfb52@dial.pipex.com> Cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics Message-ID: <443EA32B.408@digitalstratum.com> In-Reply-To: <443E9C38.709@dial.pipex.com> References: <443E95C1.4030404@digitalstratum.com> <443E9C38.709@dial.pipex.com>
index | next in thread | previous in thread | raw e-mail
Alex Zbyslaw wrote: > Matthew Hagerty wrote: > >> Can anyone shed some light on this, give me some options to try? >> What happened to kernel panics and such when there were serious >> errors going on? The only glimmer of information I have is that >> *one* time there was an error on the console about there not being >> any RAID controller available. I did purchase a spare controller and >> I'm about to swap it out and see if it helps, but for some reason I >> doubt it. If a controller like that was failing, I would certainly >> hope to see some serious error messages or panics going on. >> >> I have been running FreeBSD since version 1.01 and have never had a >> box so unstable in the last 12 or so years, especially one that is >> supposed to be "server" quality instead of the make-shift ones I put >> together with desktop hardware. And last, I'm getting sick of my >> Linux admin friends telling me "told you so! should have run >> Linux...", please give me something to stick in their pie holes! > > Several times now I have had Linux servers (and production quality > ones, not built by me ones :-)) die in a somewhat similar fashion. In > every case the cause has been either a flaky disk or a flaky disk > controller, or some combination. > > What seems to happen is that the disk is entirely "lost" by the OS. > At that point any process which never accesses the disk (i.e. is > already in memory) is able to run but the moment any process tries to > access the disk it locks up. So you can't ssh in to the server, but > if you happen to be logged in, you shell is probably cached and keeps > working. If you typed ls recently, you can run ls (but see nothing or > get a cryptic error message like I/O Error), for example. > > Clearly nothing is logged as the disk has gone AWOL. Often the > machines behaved fine after a reboot and then did the same some time > later. In one case, the supposedly transparent "RAID-1" array was > completely broken, but Linux logged precisely nothing to tell you :-( > You can stick that where you like in your Linux friends :-O > > This somewhat fits with your symptoms. If the disk vanished, then all > those postgres processes would probably fail unless everything they > needed happened to be cached in RAM. The Web server and PHP scripts > probably are cached in RAM if they are called frequently so you might > well see lots of postgres processes stacked up. > > LSI MegaRAID has a CLI of sorts in sysutils/megarc. You might start > with that (and check the RAID BIOS next time the machine reboots). > > I'd say that if you have an alternative RAID controller that would be > a good place to start. If LSI do any stndalone diagnostics, you could > try those. > > --Alex > > PS Kernel's usually panic when some internal state is just too wrong > to continue. A disk or even a controller disappearing isn't going to > make the internal state wrong - it's just a device gone missing - so I > would not be surprised if the machine just locked up. > Hmm, that just seems odd that a disk controller just vanishing would not cause some sort of console message? Even if the disk device is gone, /dev/console should still be intact to display an error, no? Also, a disk device that is all of a sudden missing seems pretty serious to me, since a disk is one of the main devices that modern OSes cannot run without (generally speaking.) I would think *some* console message should be warranted. I'll see if there are any diag programs for the controller and I'll go ahead and swap the controller out. I wonder if the RAID configuration in stored in the controller or on the disks? I'd hate to have to rebuild the server install... Thanks for the info. Matthewhome | help
Want to link to this message? Use this
URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?443EA32B.408>
