Skip site navigation (1)Skip section navigation (2)
Date:      Mon, 21 May 2012 14:47:45 -0400
From:      Michael Powell <nightrecon@hotmail.com>
To:        freebsd-questions@freebsd.org
Subject:   Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
Message-ID:  <jpe2kg$bb5$2@dough.gmane.org>
References:  <op.wbwe9s0k34t2sn@tech304> <op.wen3bwws34t2sn@tech304>

next in thread | previous in thread | raw e-mail | index | archive | help
Mark Felder wrote:

> OK guys I've been talking with another user who can recreate this crash
> and the last bit of information we've learned seems to be leaning towards
> interrupts/IRQ issues like someone (bz@ perhaps?) suggested.
> 
> I'm still trying to test this myself, but the other user was able to
> recreate my crash pretty much on demand. The fix was to not use the first
> NIC in the VM because it will always share an IRQ with mpt0. Once mpt0 is
> on its own the crash does not seem to be reproducible anymore.
> 
[snip]

I am not anywhere near your level in this subject area. My understanding is 
limited and do not have the in-depth experience. However, please allow me to 
possibly add an idea or two.

I am shakedown testing FreeBSD 9 in a VirtualBox VM - so there is definitely 
a degree of 'apples vs oranges' present. VirtualBox (as I am using it) is a 
userland app and not a bare-metal hypervisor. When I set up the VM I chose 
to use the synthetic SAS controller as that would best represent actual 
server hardware in my workplace, along with the corresponding mpt driver in 
the FreeBSD 9 guest.

Please note some of the following for comparative purposes only:

[...]
Event timer "LAPIC" quality 400
ACPI APIC Table: <VBOX   VBOXAPIC>
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
ioapic0 <Version 1.1> irqs 0-23 on motherboard
kbd1 at kbdmux0
acpi0: <VBOX VBOXXSDT> on motherboard
acpi0: Power Button (fixed)
acpi0: Sleep Button (fixed)
Timecounter "HPET" frequency 14318180 Hz quality 950
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
acpi_timer0: <32-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0
[...]
em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.3> port 0xd000-0xd007 
mem 0xf0000000-0xf001ffff irq 19 at device 3.0 on pci0
[...]
mpt0: <LSILogic SAS/SATA Adapter> port 0xd100-0xd1ff mem 
0xf0820000-0xf083ffff,0xf0840000-0xf085ffff irq 22 at device 22.0 on pci0
mpt0: MPI Version=1.5.0.0
[...]

The em0 is the first Intel NIC in Vbox and notice how it and mpt0 come up 
with distinctly different IRQs.

A sysctl -a |grep mpt returns this:

device	mpt
kern.sched.preemption: 1
kern.sched.preempt_thresh: 80
dev.mpt.0.%desc: LSILogic SAS/SATA Adapter
dev.mpt.0.%driver: mpt
dev.mpt.0.%location: slot=22 function=0
dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0054 subvendor=0x1000 
subdevice=0x8000 class=0x010000
dev.mpt.0.%parent: pci0
dev.mpt.0.debug: 3
dev.mpt.0.role: 1

Very curious how 'irq 22 at device 22.0' and 'dev.mpt.0.%location: slot=22' 
all match with a '22'.

The obvious thing here is we are comparing a userland Vbox guest to a VMWare 
hypervisor. From what little I know concerning any of this, to me it sounds 
vaguely like an APIC, LAPIC, and IO/APIC bug. There are known bugs wrt to 
BIOS setting up IRQ routing incorrectly, and/or providing incorrect ACPI 
and/or IMS tables to operating systems.

The parallel in this case would be the logical or synthetic so-called "BIOS" 
that the VMWare hypervisor presents to the FreeBSD guest at guest boot time. 
In this case the truest fix for the problem would fall to VMWare, e.g. if the 
hypervisor is setting up tables in such a way as to create the shared IRQ 
problem in the first place.

If my idea/theory/potential hypothesis has any merit. I do not understand 
why any of this would be different depending upon which guest is installed, 
but I also know absolutely nothing about VMWare hypervisor internals.

> 
> Is there any other way we can make mpt0 get its own dedicated IRQ without
> having to do this? The problem is that it causes us to have to make
> rc.conf changes, pf.conf changes, and who knows what other software could
> be on these machines that is trying to bind to a specific NIC...
> 

Very possibly Andrew's device.hints is probably your best shot at a 
workaround. 

Wish you the best of luck in any case. You have done quite a job in 
researching this problem even to arrive at this point. Thank-you for that, 
and for sharing it with the community. Even though I can't really offer the 
kind of assistance you require, I have followed along with interest for self 
edification.

-Mike




Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?jpe2kg$bb5$2>