From owner-freebsd-hackers@FreeBSD.ORG  Thu Mar 29 18:23:00 2012
Return-Path: <owner-freebsd-hackers@FreeBSD.ORG>
Delivered-To: freebsd-hackers@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34])
	by hub.freebsd.org (Postfix) with ESMTP id 0B292106566B;
	Thu, 29 Mar 2012 18:23:00 +0000 (UTC) (envelope-from feld@feld.me)
Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2])
	by mx1.freebsd.org (Postfix) with ESMTP id D0F438FC14;
	Thu, 29 Mar 2012 18:22:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me;
	s=blargle; 
	h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:Cc:To:Content-Type;
	bh=L41xhzYIUziylE2npfuvjW7UKgbZ60c/QSWKj0eGJSs=; 
	b=DDU2g36zHGaiYqnXW/gVMB8c0j4JtH334KN6nmwBtQk8OGI16MUXJ5d/9n+n6K3pvE43p9lMt/hHbyJn3SOzKOuF10K94RjaaK++50VDPHHNeKooOdo5/2AWmFOeo1Ex;
Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org)
	by feld.me with esmtp (Exim 4.77 (FreeBSD))
	(envelope-from <feld@feld.me>)
	id 1SDJzp-000LKs-7F; Thu, 29 Mar 2012 13:22:58 -0500
Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4)
	with esmtpa id 1333045367-20726-20725/5/29; Thu, 29 Mar 2012 18:22:47
	+0000
Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes
To: freebsd-hackers@freebsd.org, freebsd-questions@freebsd.org
References: <201203291549.q2TFnUc7080406@aurora.sol.net>
	<201203291755.36651.hselasky@c2i.net> <op.wbxxb9cz34t2sn@tech304>
	<CAJUyCcNn+8uDrWGJMUD8vmmJKLA0iJjy6bhDSZvGB82X6awAPw@mail.gmail.com>
Date: Thu, 29 Mar 2012 13:22:46 -0500
Mime-Version: 1.0
From: Mark Felder <feld@feld.me>
Message-Id: <op.wbx2n80s34t2sn@tech304>
In-Reply-To: <CAJUyCcNn+8uDrWGJMUD8vmmJKLA0iJjy6bhDSZvGB82X6awAPw@mail.gmail.com>
User-Agent: Opera Mail/11.62 (FreeBSD)
X-SA-Score: -1.5
Cc: alc@freebsd.org, Alan Cox <alan.l.cox@gmail.com>
Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
X-BeenThere: freebsd-hackers@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Technical Discussions relating to FreeBSD
	<freebsd-hackers.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>, 
	<mailto:freebsd-hackers-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-hackers>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Help: <mailto:freebsd-hackers-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-hackers>,
	<mailto:freebsd-hackers-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Thu, 29 Mar 2012 18:23:00 -0000

On Thu, 29 Mar 2012 11:53:02 -0500, Alan Cox <alan.l.cox@gmail.com> wrote:

>
> Not so long ago, VMware implemented a clever scheme for reducing the
> overhead of virtualized interrupts that must be delivered by at least  
> some
> (if not all) of their emulated storage controllers:
>
> http://static.usenix.org/events/atc11/tech/techAbstracts.html#Ahmad
>
> Perhaps, there is a bad interaction between this scheme and FreeBSD's mpt
> driver.
>
> Alan

If we assume mpt is the culprit how can I go about diagnosing this more  
accurately? Is there something I should be looking for in vmstat -i? Too  
many interrupts? Not enough? Rate too high or too low? Or is this  
something that is much harder to track down because we're dealing with  
emulated hardware?

If any BSD devs are interested in access to our environment I think we  
could comply. I might even be able to get authorization to give you an  
account on the most crash-prone server which doesn't have any sensitive  
customer data on it. I think at this point we'd even be willing to pay  
someone to look at a server in this state just so we (and hopefully  
others) can benefit.... and hopefully we end up with a more reliable  
FreeBSD-on-VMWare for everyone.

I know Doug mentioned running newer OS versions and that is definitely  
tempting but because it's not 100% reproducible on demand it's hard to  
prove it fixes it without waiting 6 months. We're fighting internally here  
with "trust 9.0 fixes it" vs "jump back to 7.4 because we KNOW it doesn't  
happen there". Having someone look at this and say "oh, yes, that's a  
deficiency in mpt that appears to be fixed in the newer driver that was  
MFC'd to 8-STABLE and you'll find in 8.3-RELEASE and 9.0-RELEASE" would be  
more comforting.

Thanks to everyone for their time on this!