Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Mar 2012 15:59:42 -0500
From:      Mark Felder <feld@feld.me>
To:        freebsd-questions@freebsd.org, freebsd-hackers@freebsd.org
Subject:   Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
Message-ID:  <op.wbwe9s0k34t2sn@tech304>

next in thread | raw e-mail | index | archive | help
Alright guys, I'm at the end of my rope here. For those that haven't seen  
my previous emails here's the (not so) quick breakdown:

Overview:

FreeBSD ?? - 7.4 never crash
FreeBSD 8.0 - 8.2 crashes
FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in our  
production at this time, and we were hoping we could base some stuff on  
8.3 for long term stability...)
ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on others.


History:

Over the course of the last 2 years we've been banging our heads on the  
wall. VMWare is done debugging this. They claim it's not a VMWare issue.  
They can't identify what the heck happens. We had a glimmer of hope with  
ESXi 5.0 fixing it because we never saw any crashes in the handful of  
deployments, but our dreams were crushed today -- two days before an  
outage to begin migration to ESXi 5.0 -- when a customer's ESXi 5.0 server  
and FreeBSD 8.2 guest crashed.


Crash Details:

The keyboard/mouse usually stops responding for input on the console;  
normally we can't type in a username or password. However, we can switch  
VTs.

If there's a shell on the console and we can type, we can only run things  
in memory. Any time we try to access the disk it will hang indefinitely.

The server still has network access. We can ping it without issue. SSH of  
course kicks you out because it can't do any I/O.

If we were to serve a lightweight http server off a memory backed  
filesystem I'm confident it would run just fine as long as it wasn't  
logging or anything.

On ESXi you see that there is a CPU spike of 100% that goes on  
indefinitely. No idea what the FreeBSD OS itself thinks it is doing  
because we can't run top during the crash.

This crash can affect a server and happen multiple times a week. It can  
also not show up for 180 days or more. But it does happen. The server can  
be 100% idle and crash. We have servers that do more I/O than the ones  
that crash could ever attempt to do and these don't crash at all.  
Completely inexplicable.


Things we've looked into:

Nothing about the installed software matters. We've tried cross  
referencing the crashed servers by the programs they run but the base OS  
is the only common denominator due to the wide variety of servers it has  
affected.

Storage doesn't matter. We've tried different iSCSI SANs, we've tried  
different switches, we've tried local datastores on the ESXi servers  
themselves.

HP servers, Dell servers -- doesn't seem to matter either. (All with  
latest firmwares, BIOSes, etc)

VMWare gave us a ton of debugging tasks, and we've given them gigabytes of  
debugging info and data; they can't find anything.

VMWare tools -- with, without, using open-vm-tools makes no difference. I  
think we've done a fair job ruling out VMWare.


I think we've finally found enough data that this is definitely something  
in the FreeBSD world. I'm going to begin prepping some of the known crashy  
servers with more debugging. Any suggestions on what I should build the  
kernel with? They never do a proper panic, but I definitely want to at  
least *try* to get into the debugger the next time it crashes. And when it  
crashes, what the heck should I be running? I've never played with the KDB  
before...


Thank you for any suggestions and help you can give me....



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?op.wbwe9s0k34t2sn>