From owner-freebsd-hackers@FreeBSD.ORG Thu Mar 29 18:23:00 2012 Return-Path: Delivered-To: freebsd-hackers@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 0B292106566B; Thu, 29 Mar 2012 18:23:00 +0000 (UTC) (envelope-from feld@feld.me) Received: from feld.me (unknown [IPv6:2607:f4e0:100:300::2]) by mx1.freebsd.org (Postfix) with ESMTP id D0F438FC14; Thu, 29 Mar 2012 18:22:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=feld.me; s=blargle; h=In-Reply-To:Message-Id:From:Mime-Version:Date:References:Subject:Cc:To:Content-Type; bh=L41xhzYIUziylE2npfuvjW7UKgbZ60c/QSWKj0eGJSs=; b=DDU2g36zHGaiYqnXW/gVMB8c0j4JtH334KN6nmwBtQk8OGI16MUXJ5d/9n+n6K3pvE43p9lMt/hHbyJn3SOzKOuF10K94RjaaK++50VDPHHNeKooOdo5/2AWmFOeo1Ex; Received: from localhost ([127.0.0.1] helo=mwi1.coffeenet.org) by feld.me with esmtp (Exim 4.77 (FreeBSD)) (envelope-from ) id 1SDJzp-000LKs-7F; Thu, 29 Mar 2012 13:22:58 -0500 Received: from feld@feld.me by mwi1.coffeenet.org (Archiveopteryx 3.1.4) with esmtpa id 1333045367-20726-20725/5/29; Thu, 29 Mar 2012 18:22:47 +0000 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes To: freebsd-hackers@freebsd.org, freebsd-questions@freebsd.org References: <201203291549.q2TFnUc7080406@aurora.sol.net> <201203291755.36651.hselasky@c2i.net> Date: Thu, 29 Mar 2012 13:22:46 -0500 Mime-Version: 1.0 From: Mark Felder Message-Id: In-Reply-To: User-Agent: Opera Mail/11.62 (FreeBSD) X-SA-Score: -1.5 Cc: alc@freebsd.org, Alan Cox Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 29 Mar 2012 18:23:00 -0000 On Thu, 29 Mar 2012 11:53:02 -0500, Alan Cox wrote: > > Not so long ago, VMware implemented a clever scheme for reducing the > overhead of virtualized interrupts that must be delivered by at least > some > (if not all) of their emulated storage controllers: > > http://static.usenix.org/events/atc11/tech/techAbstracts.html#Ahmad > > Perhaps, there is a bad interaction between this scheme and FreeBSD's mpt > driver. > > Alan If we assume mpt is the culprit how can I go about diagnosing this more accurately? Is there something I should be looking for in vmstat -i? Too many interrupts? Not enough? Rate too high or too low? Or is this something that is much harder to track down because we're dealing with emulated hardware? If any BSD devs are interested in access to our environment I think we could comply. I might even be able to get authorization to give you an account on the most crash-prone server which doesn't have any sensitive customer data on it. I think at this point we'd even be willing to pay someone to look at a server in this state just so we (and hopefully others) can benefit.... and hopefully we end up with a more reliable FreeBSD-on-VMWare for everyone. I know Doug mentioned running newer OS versions and that is definitely tempting but because it's not 100% reproducible on demand it's hard to prove it fixes it without waiting 6 months. We're fighting internally here with "trust 9.0 fixes it" vs "jump back to 7.4 because we KNOW it doesn't happen there". Having someone look at this and say "oh, yes, that's a deficiency in mpt that appears to be fixed in the newer driver that was MFC'd to 8-STABLE and you'll find in 8.3-RELEASE and 9.0-RELEASE" would be more comforting. Thanks to everyone for their time on this!