Date: Thu, 13 Apr 2006 12:15:39 -0700 From: Julian Elischer <julian@elischer.org> To: matthew@digitalstratum.com Cc: freebsd-hackers@freebsd.org Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics Message-ID: <443EA35B.4030909@elischer.org> In-Reply-To: <443EA113.10205@digitalstratum.com> References: <443E95C1.4030404@digitalstratum.com> <200604131436.17942.jhb@freebsd.org> <443EA113.10205@digitalstratum.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Matthew Hagerty wrote: > John Baldwin wrote: > >> On Thursday 13 April 2006 14:17, Matthew Hagerty wrote: >> >> >>> Greetings, >>> >>> I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon >>> Intel motherboard with a LSILogic MegaRAID (amr0) controller. This >>> machine has been running for about 2 years now, and was very stable >>> until I updated from 5.3 to 5.4, and now 6.0. The crashing seems to >>> be totally random and I have had it crash in as little as 12 hours >>> and as long as 143 days. >>> >>> When the box goes down it does so in a strange way. First, it still >>> responds to network probes like ping (usually), however, all console >>> access is ignored. Also, some network ports still respond, like a >>> telnet to port 22 to test SSH will yield an SSH banner, but trying >>> to connect with SSH just hangs. Sometimes this is also true of the >>> SMTP server, but not always. This also makes it impossible for me >>> to use CARP to swap to the recently purchased spare machine, since >>> the network interface is generally still responding so CARP does not >>> detect a problem. >>> >>> My biggest problem with this is that there are *never* any console >>> messages or log entries in any logs, no warnings about disk failure, >>> buffer exhaustion, system failures, etc.. The machine simply seems >>> to stop responding and the only way to correct the problem is a hard >>> reboot. >>> >>> A strange thing did happen yesterday though, I believe I caught the >>> box on the verge of failure. I was SSH'd in and did a ps to check >>> things out. There were about 100 of these entries: >>> >>> 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup >>> (postgres) >>> >>> The box runs a web-based app and connects to a local Postgres DB >>> which seemed to be unable to start new connections being requested >>> by the PHP scripts. At any rate, I stopped Apache and then tried to >>> stop Postgres which resulted in (or just happened to coincide with) >>> the box locking up and no longer responding to my SSH commands or >>> attempts to reconnect with SSH. I hardly think this is a Postgres >>> problem, but even if it was, a userland app should *not* be able to >>> bring down a box... >>> >>> Can anyone shed some light on this, give me some options to try? >>> What happened to kernel panics and such when there were serious >>> errors going on? The only glimmer of information I have is that >>> *one* time there was an error on the console about there not being >>> any RAID controller available. I did purchase a spare controller >>> and I'm about to swap it out and see if it helps, but for some >>> reason I doubt it. If a controller like that was failing, I would >>> certainly hope to see some serious error messages or panics going on. >>> >>> I have been running FreeBSD since version 1.01 and have never had a >>> box so unstable in the last 12 or so years, especially one that is >>> supposed to be "server" quality instead of the make-shift ones I put >>> together with desktop hardware. And last, I'm getting sick of my >>> Linux admin friends telling me "told you so! should have run >>> Linux...", please give me something to stick in their pie holes! >>> >> >> >> It sounds like a livelock (or deadlock) more than a crash. Can you add >> 'DDB' in your kernel config and break into the debugger when it hangs >> and grab the output of 'ps'? >> >> > > I can probably figure out how to compile in DDB (I've never done if > before though), but just two questions: add options DDB to your kenrnel config file. > > 1. How do I break into DDB and grab the ps output? on the console, hit <CTRL><ALT><ESC> keys (at once) that should put you into the debugger.. then 'ps' will give you some output. It's a lot to write down but I've found a camera phone makes good enough snapshots :-) alternatively you can use a serial console, but getting into the debugger is harder, you have to have compiled in ALT_BREAK_TO_DEBUGGER into your kernel by adding # Solaris implements a new BREAK which is initiated by a character # sequence CR ~ ^b which is similar to a familiar pattern used on # Sun servers by the Remote Console. options ALT_BREAK_TO_DEBUGGER to the kernel config file you are using.. at the boot prompt (where the 10 second delay is) type set console="comconsole" (from memory) to make the serial port the console. then you can do console stuff from another window/machine and capture the outout easily. > > 2. How can I login if the box is not responding to SSH or the > console? It was only by sheer luck that I caught it yesterday just > before the lockup, I have never been able to do that before. > > Thanks, > Matthew > > _______________________________________________ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to > "freebsd-hackers-unsubscribe@freebsd.org"
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?443EA35B.4030909>