Date: Thu, 13 Apr 2006 14:36:16 -0400 From: John Baldwin <jhb@freebsd.org> To: freebsd-hackers@freebsd.org, matthew@digitalstratum.com Subject: Re: FreeBSD Crash without Errors, Warnings, or Panics Message-ID: <200604131436.17942.jhb@freebsd.org> In-Reply-To: <443E95C1.4030404@digitalstratum.com> References: <443E95C1.4030404@digitalstratum.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thursday 13 April 2006 14:17, Matthew Hagerty wrote: > Greetings, > > I'm running 6.0-RELEASE-p5 on a Toshiba built server: dual Xeon Intel > motherboard with a LSILogic MegaRAID (amr0) controller. This machine > has been running for about 2 years now, and was very stable until I > updated from 5.3 to 5.4, and now 6.0. The crashing seems to be totally > random and I have had it crash in as little as 12 hours and as long as > 143 days. > > When the box goes down it does so in a strange way. First, it still > responds to network probes like ping (usually), however, all console > access is ignored. Also, some network ports still respond, like a > telnet to port 22 to test SSH will yield an SSH banner, but trying to > connect with SSH just hangs. Sometimes this is also true of the SMTP > server, but not always. This also makes it impossible for me to use > CARP to swap to the recently purchased spare machine, since the network > interface is generally still responding so CARP does not detect a problem. > > My biggest problem with this is that there are *never* any console > messages or log entries in any logs, no warnings about disk failure, > buffer exhaustion, system failures, etc.. The machine simply seems to > stop responding and the only way to correct the problem is a hard reboot. > > A strange thing did happen yesterday though, I believe I caught the box > on the verge of failure. I was SSH'd in and did a ps to check things > out. There were about 100 of these entries: > > 55050 ?? D 0:00.00 postmaster: ipa ipa ::1(63061) startup (postgres) > > The box runs a web-based app and connects to a local Postgres DB which > seemed to be unable to start new connections being requested by the PHP > scripts. At any rate, I stopped Apache and then tried to stop Postgres > which resulted in (or just happened to coincide with) the box locking up > and no longer responding to my SSH commands or attempts to reconnect > with SSH. I hardly think this is a Postgres problem, but even if it > was, a userland app should *not* be able to bring down a box... > > Can anyone shed some light on this, give me some options to try? What > happened to kernel panics and such when there were serious errors going > on? The only glimmer of information I have is that *one* time there was > an error on the console about there not being any RAID controller > available. I did purchase a spare controller and I'm about to swap it > out and see if it helps, but for some reason I doubt it. If a > controller like that was failing, I would certainly hope to see some > serious error messages or panics going on. > > I have been running FreeBSD since version 1.01 and have never had a box > so unstable in the last 12 or so years, especially one that is supposed > to be "server" quality instead of the make-shift ones I put together > with desktop hardware. And last, I'm getting sick of my Linux admin > friends telling me "told you so! should have run Linux...", please give > me something to stick in their pie holes! It sounds like a livelock (or deadlock) more than a crash. Can you add 'DDB' in your kernel config and break into the debugger when it hangs and grab the output of 'ps'? -- John Baldwin <jhb@FreeBSD.org> <>< http://www.FreeBSD.org/~jhb/ "Power Users Use the Power to Serve" = http://www.FreeBSD.org
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?200604131436.17942.jhb>