Date: Mon, 21 May 2012 14:47:45 -0400 From: Michael Powell <nightrecon@hotmail.com> To: freebsd-questions@freebsd.org Subject: Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash Message-ID: <jpe2kg$bb5$2@dough.gmane.org> References: <op.wbwe9s0k34t2sn@tech304> <op.wen3bwws34t2sn@tech304>
next in thread | previous in thread | raw e-mail | index | archive | help
Mark Felder wrote: > OK guys I've been talking with another user who can recreate this crash > and the last bit of information we've learned seems to be leaning towards > interrupts/IRQ issues like someone (bz@ perhaps?) suggested. > > I'm still trying to test this myself, but the other user was able to > recreate my crash pretty much on demand. The fix was to not use the first > NIC in the VM because it will always share an IRQ with mpt0. Once mpt0 is > on its own the crash does not seem to be reproducible anymore. > [snip] I am not anywhere near your level in this subject area. My understanding is limited and do not have the in-depth experience. However, please allow me to possibly add an idea or two. I am shakedown testing FreeBSD 9 in a VirtualBox VM - so there is definitely a degree of 'apples vs oranges' present. VirtualBox (as I am using it) is a userland app and not a bare-metal hypervisor. When I set up the VM I chose to use the synthetic SAS controller as that would best represent actual server hardware in my workplace, along with the corresponding mpt driver in the FreeBSD 9 guest. Please note some of the following for comparative purposes only: [...] Event timer "LAPIC" quality 400 ACPI APIC Table: <VBOX VBOXAPIC> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 ioapic0 <Version 1.1> irqs 0-23 on motherboard kbd1 at kbdmux0 acpi0: <VBOX VBOXXSDT> on motherboard acpi0: Power Button (fixed) acpi0: Sleep Button (fixed) Timecounter "HPET" frequency 14318180 Hz quality 950 Timecounter "ACPI-fast" frequency 3579545 Hz quality 900 acpi_timer0: <32-bit timer at 3.579545MHz> port 0x4008-0x400b on acpi0 [...] em0: <Intel(R) PRO/1000 Legacy Network Connection 1.0.3> port 0xd000-0xd007 mem 0xf0000000-0xf001ffff irq 19 at device 3.0 on pci0 [...] mpt0: <LSILogic SAS/SATA Adapter> port 0xd100-0xd1ff mem 0xf0820000-0xf083ffff,0xf0840000-0xf085ffff irq 22 at device 22.0 on pci0 mpt0: MPI Version=1.5.0.0 [...] The em0 is the first Intel NIC in Vbox and notice how it and mpt0 come up with distinctly different IRQs. A sysctl -a |grep mpt returns this: device mpt kern.sched.preemption: 1 kern.sched.preempt_thresh: 80 dev.mpt.0.%desc: LSILogic SAS/SATA Adapter dev.mpt.0.%driver: mpt dev.mpt.0.%location: slot=22 function=0 dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0054 subvendor=0x1000 subdevice=0x8000 class=0x010000 dev.mpt.0.%parent: pci0 dev.mpt.0.debug: 3 dev.mpt.0.role: 1 Very curious how 'irq 22 at device 22.0' and 'dev.mpt.0.%location: slot=22' all match with a '22'. The obvious thing here is we are comparing a userland Vbox guest to a VMWare hypervisor. From what little I know concerning any of this, to me it sounds vaguely like an APIC, LAPIC, and IO/APIC bug. There are known bugs wrt to BIOS setting up IRQ routing incorrectly, and/or providing incorrect ACPI and/or IMS tables to operating systems. The parallel in this case would be the logical or synthetic so-called "BIOS" that the VMWare hypervisor presents to the FreeBSD guest at guest boot time. In this case the truest fix for the problem would fall to VMWare, e.g. if the hypervisor is setting up tables in such a way as to create the shared IRQ problem in the first place. If my idea/theory/potential hypothesis has any merit. I do not understand why any of this would be different depending upon which guest is installed, but I also know absolutely nothing about VMWare hypervisor internals. > > Is there any other way we can make mpt0 get its own dedicated IRQ without > having to do this? The problem is that it causes us to have to make > rc.conf changes, pf.conf changes, and who knows what other software could > be on these machines that is trying to bind to a specific NIC... > Very possibly Andrew's device.hints is probably your best shot at a workaround. Wish you the best of luck in any case. You have done quite a job in researching this problem even to arrive at this point. Thank-you for that, and for sharing it with the community. Even though I can't really offer the kind of assistance you require, I have followed along with interest for self edification. -Mike
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?jpe2kg$bb5$2>