Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 29 Mar 2012 13:24:30 -0400
From:      Jerry <jerry@seibercom.net>
To:        FreeBSD <freebsd-questions@freebsd.org>
Subject:   Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
Message-ID:  <20120329132430.13dc08e7@scorpio>
In-Reply-To: <4F749141.8010109@gmail.com>
References:  <op.wbwe9s0k34t2sn@tech304> <4F749141.8010109@gmail.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 29 Mar 2012 11:43:45 -0500
Jim Bryant articulated:

> Mark Felder wrote:
> > Alright guys, I'm at the end of my rope here. For those that
> > haven't seen my previous emails here's the (not so) quick breakdown:
> >
> > Overview:
> >
> > FreeBSD ?? - 7.4 never crash
> > FreeBSD 8.0 - 8.2 crashes
> > FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in 
> > our production at this time, and we were hoping we could base some 
> > stuff on 8.3 for long term stability...)
> > ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on 
> > others.
> >
> >
> > History:
> >
> > Over the course of the last 2 years we've been banging our heads on 
> > the wall. VMWare is done debugging this. They claim it's not a
> > VMWare issue. They can't identify what the heck happens. We had a
> > glimmer of hope with ESXi 5.0 fixing it because we never saw any
> > crashes in the handful of deployments, but our dreams were crushed
> > today -- two days before an outage to begin migration to ESXi 5.0
> > -- when a customer's ESXi 5.0 server and FreeBSD 8.2 guest crashed.
> >
> >
> > Crash Details:
> >
> > The keyboard/mouse usually stops responding for input on the
> > console; normally we can't type in a username or password. However,
> > we can switch VTs.
> >
> > If there's a shell on the console and we can type, we can only run 
> > things in memory. Any time we try to access the disk it will hang 
> > indefinitely.
> >
> > The server still has network access. We can ping it without issue.
> > SSH of course kicks you out because it can't do any I/O.
> >
> > If we were to serve a lightweight http server off a memory backed 
> > filesystem I'm confident it would run just fine as long as it
> > wasn't logging or anything.
> >
> > On ESXi you see that there is a CPU spike of 100% that goes on 
> > indefinitely. No idea what the FreeBSD OS itself thinks it is doing 
> > because we can't run top during the crash.
> >
> > This crash can affect a server and happen multiple times a week. It 
> > can also not show up for 180 days or more. But it does happen. The 
> > server can be 100% idle and crash. We have servers that do more I/O 
> > than the ones that crash could ever attempt to do and these don't 
> > crash at all. Completely inexplicable.
> >
> >
> > Things we've looked into:
> >
> > Nothing about the installed software matters. We've tried cross 
> > referencing the crashed servers by the programs they run but the
> > base OS is the only common denominator due to the wide variety of
> > servers it has affected.
> >
> > Storage doesn't matter. We've tried different iSCSI SANs, we've
> > tried different switches, we've tried local datastores on the ESXi
> > servers themselves.
> >
> > HP servers, Dell servers -- doesn't seem to matter either. (All
> > with latest firmwares, BIOSes, etc)
> >
> > VMWare gave us a ton of debugging tasks, and we've given them 
> > gigabytes of debugging info and data; they can't find anything.
> >
> > VMWare tools -- with, without, using open-vm-tools makes no 
> > difference. I think we've done a fair job ruling out VMWare.
> >
> >
> > I think we've finally found enough data that this is definitely 
> > something in the FreeBSD world. I'm going to begin prepping some of 
> > the known crashy servers with more debugging. Any suggestions on
> > what I should build the kernel with? They never do a proper panic,
> > but I definitely want to at least *try* to get into the debugger
> > the next time it crashes. And when it crashes, what the heck should
> > I be running? I've never played with the KDB before...
> >
> >
> > Thank you for any suggestions and help you can give me....
> 
> This sounds just like a race condition that happens under Windows 7
> on this laptop.  The race condition, as far as I can tell involves
> heavy disk access and heavy network access, and usually leaves the
> drive light on, while all activity monitors (alldisk, allcpu,
> allnetwork) are still active, although on this laptop disk takes
> priority, and network slows to a crawl.  occasionally, the mouse will
> stop working, along with everything else, but usually not.  keyboard
> is lower priority, and doesn't do anything.
> 
> You might want to check with mickeysoft, this might just be their 
> problem.  This sounds so freaking similar to the issue I get, and I 
> think it's a race condition (shared interrupts??).
> 
> This laptop is a Compaq Presario C300 series, with the 945GM chipset
> and a T7600 Core2 Duo CPU, with 3G of RAM.

{TOP POSTING CORRECTED}

I just started reading this tread, but I am wondering if I missed
something here. What does this have to do with "Windows 7"?

-- 
Jerry ♔

Disclaimer: off-list followups get on-list replies or get ignored.
Please do not ignore the Reply-To header.
__________________________________________________________________




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120329132430.13dc08e7>