From owner-freebsd-net@FreeBSD.ORG Sun Jan 20 04:30:01 2013 Return-Path: Delivered-To: freebsd-net@smarthost.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by hub.freebsd.org (Postfix) with ESMTP id 3D2D980D for ; Sun, 20 Jan 2013 04:30:01 +0000 (UTC) (envelope-from gnats@FreeBSD.org) Received: from freefall.freebsd.org (freefall.freebsd.org [IPv6:2001:1900:2254:206c::16:87]) by mx1.freebsd.org (Postfix) with ESMTP id 2F7F5FA3 for ; Sun, 20 Jan 2013 04:30:01 +0000 (UTC) Received: from freefall.freebsd.org (localhost [127.0.0.1]) by freefall.freebsd.org (8.14.6/8.14.6) with ESMTP id r0K4U1lK093892 for ; Sun, 20 Jan 2013 04:30:01 GMT (envelope-from gnats@freefall.freebsd.org) Received: (from gnats@localhost) by freefall.freebsd.org (8.14.6/8.14.6/Submit) id r0K4U1A4093891; Sun, 20 Jan 2013 04:30:01 GMT (envelope-from gnats) Date: Sun, 20 Jan 2013 04:30:01 GMT Message-Id: <201301200430.r0K4U1A4093891@freefall.freebsd.org> To: freebsd-net@FreeBSD.org Cc: From: John Baldwin Subject: Re: kern/172113: [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cluster type X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.14 Precedence: list Reply-To: John Baldwin List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 20 Jan 2013 04:30:01 -0000 The following reply was made to PR kern/172113; it has been noted by GNATS. From: John Baldwin To: bug-followup@FreeBSD.org, egrosbein@rdtc.ru Cc: jfv@FreeBSD.org, George Neville-Neil Subject: Re: kern/172113: [panic] [e1000] [patch] 9.1-RC1/amd64 panices in igb(4): m_getjcl: invalid cluster type Date: Sat, 19 Jan 2013 23:26:17 -0500 I was able to finally reproduce this panic today. It seems to require a server configured for PXE but that receives no DHCP reply (and possibly with the requisite SuperMicro X8 board). I was able to prevent the panic with a subset of the referenced patch by only adding the 'if_drv_flags & IFF_DRV_RUNNING' check to the start of igb_msix_que(). The rest of the patch was unnecessary. I also added some debugging to print out the ICR, EICR, IMS, and EIMS registers in this case. It does look like the hardware is sending an interrupt that is not enabled in the interrupt mask (specifically LSC). In fact, the 82576 datasheet specifically mentions masking LSC until initialization is complete to avoid spurious interrupts during boot and AFAICT igb(4) does this since e1000_reset_hw() clears the interrupt mask via writes to IMC and doesn't re-enable interrupts until igb_init_locked() is invoked via 'ifconfig up'. Here is my debug output: SMP: AP CPU #6 Launched! SMP: AP CPU #4 Launched! stray irq0 igb0: interrupt on que 0: icr 0x1000004 eicr 0 ims 0 eims 0x80000000 Hmmm. Nothing clears EIMS. After some more debugging, I determined that e1000_reset_hw() always turns this bit in EIMS on, even if it is off before e1000_reset_hw() is called(!). I added explicit calls to igb_disable_intr() to clear EIMS after each call to e1000_reset_hw(). This removes the 'stray irq0', but I still get a spurious interrupt during boot (albeit with eims 0). I can use the IFF_DRV_RUNNING hack for now, but I think the real fix is something else. -- John Baldwin