From owner-freebsd-net@FreeBSD.ORG  Fri May  4 15:35:02 2012
Return-Path: <owner-freebsd-net@FreeBSD.ORG>
Delivered-To: net@freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52])
	by hub.freebsd.org (Postfix) with ESMTP id E17CB106566B;
	Fri,  4 May 2012 15:35:01 +0000 (UTC) (envelope-from jhb@freebsd.org)
Received: from bigwig.baldwin.cx (bigknife-pt.tunnel.tserv9.chi1.ipv6.he.net
	[IPv6:2001:470:1f10:75::2])
	by mx1.freebsd.org (Postfix) with ESMTP id AFBB68FC18;
	Fri,  4 May 2012 15:35:01 +0000 (UTC)
Received: from jhbbsd.localnet (unknown [209.249.190.124])
	by bigwig.baldwin.cx (Postfix) with ESMTPSA id 03FBBB997;
	Fri,  4 May 2012 11:35:01 -0400 (EDT)
From: John Baldwin <jhb@freebsd.org>
To: Konstantin Belousov <kostikbel@gmail.com>
Date: Fri, 4 May 2012 11:30:22 -0400
User-Agent: KMail/1.13.5 (FreeBSD/8.2-CBSD-20110714-p13; KDE/4.5.5; amd64; ; )
References: <20120407133715.GU2358@deviant.kiev.zoral.com.ua>
	<20120412183849.GA2358@deviant.kiev.zoral.com.ua>
	<20120501162121.GV2358@deviant.kiev.zoral.com.ua>
In-Reply-To: <20120501162121.GV2358@deviant.kiev.zoral.com.ua>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-15"
Content-Transfer-Encoding: 7bit
Message-Id: <201205041130.22202.jhb@freebsd.org>
X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.2.7
	(bigwig.baldwin.cx); Fri, 04 May 2012 11:35:01 -0400 (EDT)
Cc: jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org
Subject: Re: 82574L hangs (with r233708 e1000 driver).
X-BeenThere: freebsd-net@freebsd.org
X-Mailman-Version: 2.1.5
Precedence: list
List-Id: Networking and TCP/IP with FreeBSD <freebsd-net.freebsd.org>
List-Unsubscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=unsubscribe>
List-Archive: <http://lists.freebsd.org/pipermail/freebsd-net>
List-Post: <mailto:freebsd-net@freebsd.org>
List-Help: <mailto:freebsd-net-request@freebsd.org?subject=help>
List-Subscribe: <http://lists.freebsd.org/mailman/listinfo/freebsd-net>,
	<mailto:freebsd-net-request@freebsd.org?subject=subscribe>
X-List-Received-Date: Fri, 04 May 2012 15:35:02 -0000

On Tuesday, May 01, 2012 12:21:21 pm Konstantin Belousov wrote:
> On Thu, Apr 12, 2012 at 09:38:49PM +0300, Konstantin Belousov wrote:
> > On Mon, Apr 09, 2012 at 12:19:39PM -0400, John Baldwin wrote:
> > > On Sunday, April 08, 2012 1:11:25 am Konstantin Belousov wrote:
> > > > On Sat, Apr 07, 2012 at 04:22:07PM -0700, Jack Vogel wrote:
> > > > > Make sure you have any firmware up to the latest available, if that 
doesn't
> > > > > help
> > > > > let me know and I'll check internally to see if there are any 
outstanding
> > > > > issues
> > > > > in shared code,  that will be after the weekend.
> > > > 
> > > > I had BIOS rev. 151, after you hint I found rev. 154 on the site.
> > > > Now BIOS reports itself as MTCDT10N.86A.0154.2012.0323.1601,
> > > > March 23.
> > > > 
> > > > Unfortunately, upgrade did not changed anything in regard of hanging
> > > > interface.
> > > 
> > > Does reverting 233708 make any difference?  Have you tried futzing 
around with
> > > kgdb when it is hung to see what state the device is in (software state 
at
> > > least)?
> > It does, in a sense that without r233708 the interface becomes stuck
> > almost immediately. I just upgraded to the e1000@r234154, which does not
> > change much.
> > 
> > I fiddled with the adapter state after the hang in kgdb more, and I
> > noted something interesting. Apparently, tx works. When I ping the remote
> > host from my suffering atom machine, remote host sees the packet. Also
> > remote machine sees some udp traffic originating from the tom, like
> > ntp queries.
> > 
> > And, on receive, the atom board does receive interrupts, em0:rx 0 counter
> > in vmstat -i increases. Even more fun, the sysctl dev.em.0.debug
> > shows increasing hw rdh (as I understand, this is hardware 'last
> > received' packet pointer for rx ring). So I looked at the packet
> > descriptor at hw rdt index, and there I see
> > (kgdb) p/x ((struct adapter *)0xffffff80010e4000)->rx_rings->rx_base[78]
> > $11 = {buffer_addr = 0x12a128800, length = 0x5ea, csum = 0x3c2b, status = 
0x0, 
> >   errors = 0x0, special = 0x0}
> > 
> > Apparently, the Descriptor Done bit is clear, so the em_rxeof() function
> > breaks from the loop, not consuming the current packet. Also, it returns
> > false due to DD bit clear. This prevents em_msix_rx() from scheduling
> > taskqueue for processing. So apparent cause for the hang is missing
> > DD bit in descriptor.
> > 
> > I am not sure isn't all this is obvious for anybody who knows em
> > internals, and were to go from there.
> 
> Ok, nobody cares.
> 
> Below is the workaround I use to prevent the interface wedging.
> It seems that the sole PCI register read (namely, the rx ring head read)
> and consequent recheck of the descriptor status greatly reduce the
> likelihood of the issue. Unfortunately, the read does not eliminate
> the hang completely. So it is not some PCIe coherency problem.
> 
> With the patch applied, I am able to copy around blu-ray images, while
> previously the interface hang in 20-30 seconds of 100Mbit/s traffic.
> Sometimes the messages are printed:
> em0: Workaround: head 1018 tail 1002 cur 1010
> em0: Workaround: head 976 tail 973 cur 974
> em0: Workaround: head 950 tail 939 cur 946
> em0: Workaround: head 435 tail 419 cur 426
> 
> Machine is still dead due to random memory corruption which I see, in
> particular, pmap sometimes read garbage from PTEs. I have no idea is
> it related to em0 rx descriptor missed writes, or is a different issue.

Humm, so if I'm reading this correctly, the card "skips" a receive
descriptor and stores a packet at the next descriptor?  That's just
bizarre.

-- 
John Baldwin