Date: Thu, 29 Mar 2012 13:24:30 -0400 From: Jerry <jerry@seibercom.net> To: FreeBSD <freebsd-questions@freebsd.org> Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash Message-ID: <20120329132430.13dc08e7@scorpio> In-Reply-To: <4F749141.8010109@gmail.com> References: <op.wbwe9s0k34t2sn@tech304> <4F749141.8010109@gmail.com>
next in thread | previous in thread | raw e-mail | index | archive | help
On Thu, 29 Mar 2012 11:43:45 -0500 Jim Bryant articulated: > Mark Felder wrote: > > Alright guys, I'm at the end of my rope here. For those that > > haven't seen my previous emails here's the (not so) quick breakdown: > > > > Overview: > > > > FreeBSD ?? - 7.4 never crash > > FreeBSD 8.0 - 8.2 crashes > > FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in > > our production at this time, and we were hoping we could base some > > stuff on 8.3 for long term stability...) > > ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on > > others. > > > > > > History: > > > > Over the course of the last 2 years we've been banging our heads on > > the wall. VMWare is done debugging this. They claim it's not a > > VMWare issue. They can't identify what the heck happens. We had a > > glimmer of hope with ESXi 5.0 fixing it because we never saw any > > crashes in the handful of deployments, but our dreams were crushed > > today -- two days before an outage to begin migration to ESXi 5.0 > > -- when a customer's ESXi 5.0 server and FreeBSD 8.2 guest crashed. > > > > > > Crash Details: > > > > The keyboard/mouse usually stops responding for input on the > > console; normally we can't type in a username or password. However, > > we can switch VTs. > > > > If there's a shell on the console and we can type, we can only run > > things in memory. Any time we try to access the disk it will hang > > indefinitely. > > > > The server still has network access. We can ping it without issue. > > SSH of course kicks you out because it can't do any I/O. > > > > If we were to serve a lightweight http server off a memory backed > > filesystem I'm confident it would run just fine as long as it > > wasn't logging or anything. > > > > On ESXi you see that there is a CPU spike of 100% that goes on > > indefinitely. No idea what the FreeBSD OS itself thinks it is doing > > because we can't run top during the crash. > > > > This crash can affect a server and happen multiple times a week. It > > can also not show up for 180 days or more. But it does happen. The > > server can be 100% idle and crash. We have servers that do more I/O > > than the ones that crash could ever attempt to do and these don't > > crash at all. Completely inexplicable. > > > > > > Things we've looked into: > > > > Nothing about the installed software matters. We've tried cross > > referencing the crashed servers by the programs they run but the > > base OS is the only common denominator due to the wide variety of > > servers it has affected. > > > > Storage doesn't matter. We've tried different iSCSI SANs, we've > > tried different switches, we've tried local datastores on the ESXi > > servers themselves. > > > > HP servers, Dell servers -- doesn't seem to matter either. (All > > with latest firmwares, BIOSes, etc) > > > > VMWare gave us a ton of debugging tasks, and we've given them > > gigabytes of debugging info and data; they can't find anything. > > > > VMWare tools -- with, without, using open-vm-tools makes no > > difference. I think we've done a fair job ruling out VMWare. > > > > > > I think we've finally found enough data that this is definitely > > something in the FreeBSD world. I'm going to begin prepping some of > > the known crashy servers with more debugging. Any suggestions on > > what I should build the kernel with? They never do a proper panic, > > but I definitely want to at least *try* to get into the debugger > > the next time it crashes. And when it crashes, what the heck should > > I be running? I've never played with the KDB before... > > > > > > Thank you for any suggestions and help you can give me.... > > This sounds just like a race condition that happens under Windows 7 > on this laptop. The race condition, as far as I can tell involves > heavy disk access and heavy network access, and usually leaves the > drive light on, while all activity monitors (alldisk, allcpu, > allnetwork) are still active, although on this laptop disk takes > priority, and network slows to a crawl. occasionally, the mouse will > stop working, along with everything else, but usually not. keyboard > is lower priority, and doesn't do anything. > > You might want to check with mickeysoft, this might just be their > problem. This sounds so freaking similar to the issue I get, and I > think it's a race condition (shared interrupts??). > > This laptop is a Compaq Presario C300 series, with the 945GM chipset > and a T7600 Core2 Duo CPU, with 3G of RAM. {TOP POSTING CORRECTED} I just started reading this tread, but I am wondering if I missed something here. What does this have to do with "Windows 7"? -- Jerry ♔ Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. __________________________________________________________________
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120329132430.13dc08e7>