From owner-freebsd-net@FreeBSD.ORG Fri Dec 17 17:38:11 2010 Return-Path: Delivered-To: freebsd-net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 43201106564A for ; Fri, 17 Dec 2010 17:38:11 +0000 (UTC) (envelope-from seanbru@yahoo-inc.com) Received: from mrout1-b.corp.re1.yahoo.com (mrout1-b.corp.re1.yahoo.com [69.147.107.20]) by mx1.freebsd.org (Postfix) with ESMTP id F09E18FC1B for ; Fri, 17 Dec 2010 17:38:10 +0000 (UTC) Received: from [127.0.0.1] (rideseveral.corp.yahoo.com [10.73.160.231]) by mrout1-b.corp.re1.yahoo.com (8.14.4/8.14.4/y.out) with ESMTP id oBHHRLHP081035 for ; Fri, 17 Dec 2010 09:27:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=yahoo-inc.com; s=cobra; t=1292606841; bh=PhuRihJd4qsL5uyiFx+/3weUOujZlYAld8//DVEG2ac=; h=Subject:From:Reply-To:To:Content-Type:Date:Message-ID: Mime-Version:Content-Transfer-Encoding; b=adbCcuj/c3UGLXC5MA+LRT+STPTwkPtQqEQsKEdXjJzcLe6JIPuvZzX7s8bs1oCct 9cL4CE1DRWsPG3o6McqMHbNySFr0ZAvsdytZiWvwls/9TlDahpwSW0BroacmFwAwOo PXttGByBs1qkdLxME4wIT+iDIkZpM8o0CCjWn2Mc= From: Sean Bruno To: "freebsd-net@freebsd.org" Content-Type: text/plain; charset="UTF-8" Date: Fri, 17 Dec 2010 09:27:20 -0800 Message-ID: <1292606841.2657.38.camel@home-yahoo> Mime-Version: 1.0 X-Mailer: Evolution 2.28.3 (2.28.3-1.fc12) Content-Transfer-Encoding: 7bit Subject: igb(4) OACTIVE logic handling X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list Reply-To: sbruno@freebsd.org List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Dec 2010 17:38:11 -0000 We're seeing igb(4) hit the OACTIVE handling parts of igb_start_locked() on 7 with hw.igb.rxd/txd set to 4096 periodically and seeing the machines fall off the network soon after. The logic to handle the unset of OACTIVE in igb_txeof() doesn't ever seem to fire and the machine is only accessible on the serial console. I've played around with a few settings, namely: hw.igb.enable_aim=0 hw.igb.enable_msix=1 hw.igb.num_queues=1 Which seem to make it better, but we still see the machines become unresponsive. I've been slowly auditing the (ab)use of igb_txeof() throughout the transmit path and I'm seeing evidence that we're not handling the THRESHOLD cases correctly. Can someone who is slightly more savvy than I throw an eyeball on the 7.4 prerelease code and see if I'm smoking dope here? Sean P.S. this is not a new problem, but one that I've only recently become aware of.