From owner-freebsd-hackers@FreeBSD.ORG Thu Apr 13 18:45:18 2006 Return-Path: X-Original-To: freebsd-hackers@freebsd.org Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id AA51F16A404 for ; Thu, 13 Apr 2006 18:45:18 +0000 (UTC) (envelope-from xfb52@dial.pipex.com) Received: from smtp-out2.blueyonder.co.uk (smtp-out2.blueyonder.co.uk [195.188.213.5]) by mx1.FreeBSD.org (Postfix) with ESMTP id 13CCA43D48 for ; Thu, 13 Apr 2006 18:45:17 +0000 (GMT) (envelope-from xfb52@dial.pipex.com) Received: from [172.23.170.143] (helo=anti-virus02-10) by smtp-out2.blueyonder.co.uk with smtp (Exim 4.52) id 1FU6od-0001aE-V8; Thu, 13 Apr 2006 19:45:15 +0100 Received: from [80.192.25.195] (helo=[192.168.0.2]) by asmtp-out3.blueyonder.co.uk with esmtp (Exim 4.52) id 1FU6oa-0007SQ-Vy; Thu, 13 Apr 2006 19:45:13 +0100 Message-ID: <443E9C38.709@dial.pipex.com> Date: Thu, 13 Apr 2006 19:45:12 +0100 From: Alex Zbyslaw User-Agent: Mozilla/5.0 (X11; U; FreeBSD i386; en-GB; rv:1.7.12) Gecko/20060305 X-Accept-Language: en MIME-Version: 1.0 To: matthew@digitalstratum.com References: <443E95C1.4030404@digitalstratum.com> In-Reply-To: <443E95C1.4030404@digitalstratum.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 13 Apr 2006 18:45:18 -0000 Matthew Hagerty wrote: > Can anyone shed some light on this, give me some options to try? What > happened to kernel panics and such when there were serious errors > going on? The only glimmer of information I have is that *one* time > there was an error on the console about there not being any RAID > controller available. I did purchase a spare controller and I'm about > to swap it out and see if it helps, but for some reason I doubt it. > If a controller like that was failing, I would certainly hope to see > some serious error messages or panics going on. > > I have been running FreeBSD since version 1.01 and have never had a > box so unstable in the last 12 or so years, especially one that is > supposed to be "server" quality instead of the make-shift ones I put > together with desktop hardware. And last, I'm getting sick of my > Linux admin friends telling me "told you so! should have run > Linux...", please give me something to stick in their pie holes! Several times now I have had Linux servers (and production quality ones, not built by me ones :-)) die in a somewhat similar fashion. In every case the cause has been either a flaky disk or a flaky disk controller, or some combination. What seems to happen is that the disk is entirely "lost" by the OS. At that point any process which never accesses the disk (i.e. is already in memory) is able to run but the moment any process tries to access the disk it locks up. So you can't ssh in to the server, but if you happen to be logged in, you shell is probably cached and keeps working. If you typed ls recently, you can run ls (but see nothing or get a cryptic error message like I/O Error), for example. Clearly nothing is logged as the disk has gone AWOL. Often the machines behaved fine after a reboot and then did the same some time later. In one case, the supposedly transparent "RAID-1" array was completely broken, but Linux logged precisely nothing to tell you :-( You can stick that where you like in your Linux friends :-O This somewhat fits with your symptoms. If the disk vanished, then all those postgres processes would probably fail unless everything they needed happened to be cached in RAM. The Web server and PHP scripts probably are cached in RAM if they are called frequently so you might well see lots of postgres processes stacked up. LSI MegaRAID has a CLI of sorts in sysutils/megarc. You might start with that (and check the RAID BIOS next time the machine reboots). I'd say that if you have an alternative RAID controller that would be a good place to start. If LSI do any stndalone diagnostics, you could try those. --Alex PS Kernel's usually panic when some internal state is just too wrong to continue. A disk or even a controller disappearing isn't going to make the internal state wrong - it's just a device gone missing - so I would not be surprised if the machine just locked up.