From owner-freebsd-net@FreeBSD.ORG Sat Nov 18 07:52:41 2006 Return-Path: X-Original-To: freebsd-net@freebsd.org Delivered-To: freebsd-net@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id 9527116B35C for ; Sat, 18 Nov 2006 07:52:41 +0000 (UTC) (envelope-from jfvogel@gmail.com) Received: from wx-out-0506.google.com (wx-out-0506.google.com [66.249.82.226]) by mx1.FreeBSD.org (Postfix) with ESMTP id 79ED145425 for ; Sat, 18 Nov 2006 04:51:27 +0000 (GMT) (envelope-from jfvogel@gmail.com) Received: by wx-out-0506.google.com with SMTP id s18so1042787wxc for ; Fri, 17 Nov 2006 20:51:27 -0800 (PST) DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=PhckOxnG1XsYiKnIo7/SJZDGdq54fOuhYhSfDP1c52bo+2UQDG5RmCO9GbdV4Z8n4Otf2f+OFbkhdOq1rI82CrSzaX4uvdvhDsUVBGW5SvO0NH9RBB9un5bIF+n7fWZSt7A5tD9F+rNp/0joab5DRfIjpG0SFEV7q86XNjUE3gs= Received: by 10.90.115.9 with SMTP id n9mr2135843agc.1163797993352; Fri, 17 Nov 2006 13:13:13 -0800 (PST) Received: by 10.35.118.6 with HTTP; Fri, 17 Nov 2006 13:13:13 -0800 (PST) Message-ID: <2a41acea0611171313k56d19031kca505b8b2117a7e3@mail.gmail.com> Date: Fri, 17 Nov 2006 13:13:13 -0800 From: "Jack Vogel" To: "John Polstra" In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Content-Disposition: inline References: Cc: freebsd-net@freebsd.org Subject: Re: Serious em problems under -current on two different platforms X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 18 Nov 2006 07:52:41 -0000 On 11/17/06, John Polstra wrote: > Folks, I'm using -current from 2006-11-16 05:00 UTC and find that my > em interfaces are unusable on two quite different platforms. I've > tried a lot of things to make sure it's not a local fubar here, > including doing a "make release" using a virgin source tree and > installing fresh from the resulting CD (with GENERIC kernel). I also > have a netbootable CD image that is part of the project I'm working > on, and it admittedly has some minor mods to the kernel. I booted > that exact same image on two different platforms with em devices in > them, and got the same results as when I used the virgin FreeBSD CD. > > I don't think this is caused by the recent MSI support. I get the > same results when I disable it by adding "hw.pci.enable_msi=0" and > "hw.pci.enable_msix=0" to my /boot/loader.conf file. (And I confirmed > that MSI wasn't being used when I did that.) > > The symptoms are complicated, so let's focus on one of the machines. > It's a Dell 1950 with two dual-core 3.0 GHz Xeons in it. The em > devices look like this (it's a dual-port card PCI-Express card): > > em0@pci11:0:0: class=0x020000 card=0x125e8086 chip=0x105e8086 rev=0x04 hdr=0x00 > vendor = 'Intel Corporation' > device = 'PRO/1000 PT' > class = network > subclass = ethernet > em1@pci11:0:1: class=0x020000 card=0x125e8086 chip=0x105e8086 rev=0x04 hdr=0x00 > vendor = 'Intel Corporation' > device = 'PRO/1000 PT' > class = network > subclass = ethernet > > Starting with a freshly-booted system, we see this ifconfig output, > as expected: > > em0: flags=8802 mtu 1500 > options=18b > ether 00:0e:0c:6f:0e:18 > media: Ethernet autoselect (1000baseTX ) > status: active > em1: flags=8802 mtu 1500 > options=18b > ether 00:0e:0c:6f:0e:19 > media: Ethernet autoselect (1000baseTX ) > status: active > > Now I do "ifconfig em0 10.5.1.1/24" and then ping that address from > another machine on the LAN: > > thin# ping 10.5.1.1 > PING 10.5.1.1 (10.5.1.1): 56 data bytes > 64 bytes from 10.5.1.1: icmp_seq=0 ttl=64 time=0.524 ms > > Then nothing after the first reply. Leaving the ping running on the > other machine, I configure the address a 2nd time on the Dell with > "ifconfig em0 10.5.1.1/24". Still no response. Next, ifconfig em0 > down and then up again. After a few seconds, the ping responses > start coming in and continue to work. Try a flood ping from the > other machine: it works fine. > > I kill the flood ping and go have lunch for a half-hour, then start > up a normal 1-per-second ping from the other machine: > > thin# ping 10.5.1.1 > PING 10.5.1.1 (10.5.1.1): 56 data bytes > 64 bytes from 10.5.1.1: icmp_seq=0 ttl=64 time=0.612 ms > [then nothing] > > This time, I check the vmstat -i output a few times, and see that > em0 isn't generating any interrupts. I ifconfig em0 down and then > up, and the pings start working again. > > Now, leaving that 1-per-second ping running, I start messing with > em1. I do "ifconfig em1 10.6.1.1/24", and within a few seconds, the > pings on em0 stop responding. Again em0 isn't generating > interrupts. Pings to em1 aren't working, either. I ifconfig em1 > down and then up. The pings still aren't working. I set em1's > address again with "ifconfig em1 10.6.1.1/24", and the pings start > working. Now I ping em0 from the other machine and find that it > works, too. Hallelujah! Now both interfaces are working at the > same time. But what's the key to getting to this point? > > I let the pings run for awhile. Pretty soon, both of them stop > working again. > > The other machine is a Tyan 2721 with dual Xeons in it. Its > dual-port NIC is on the motherboard, and it looks like this: > > em0@pci7:1:0: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82546EB Dual Port Gigabit Ethernet Controller (Copper)' > class = network > subclass = ethernet > em1@pci7:1:1: class=0x020000 card=0x10118086 chip=0x10108086 rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82546EB Dual Port Gigabit Ethernet Controller (Copper)' > class = network > subclass = ethernet > > I can't get either port to send any packets at all. When I try, the > driver reports transmit watchdog timeouts. > > Is this stuff working for anybody at all? This sounds bizarrely broken, can you try and back off the deltas of if_em.[ch] and find a point where it works? I have not been making the changes into CURRENT, and I am busy with some important Intel tasks that I must get done, so it would help knowing when it broke. Thanks, Jack