From owner-freebsd-net@FreeBSD.ORG Thu Aug 9 15:28:05 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [69.147.83.52]) by hub.freebsd.org (Postfix) with ESMTP id C76D7106564A for ; Thu, 9 Aug 2012 15:28:05 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm12-vm2.bullet.mail.ne1.yahoo.com (nm12-vm2.bullet.mail.ne1.yahoo.com [98.138.91.88]) by mx1.freebsd.org (Postfix) with SMTP id 7BD858FC08 for ; Thu, 9 Aug 2012 15:28:05 +0000 (UTC) Received: from [98.138.90.52] by nm12.bullet.mail.ne1.yahoo.com with NNFMP; 09 Aug 2012 15:25:35 -0000 Received: from [98.138.87.6] by tm5.bullet.mail.ne1.yahoo.com with NNFMP; 09 Aug 2012 15:25:35 -0000 Received: from [127.0.0.1] by omp1006.mail.ne1.yahoo.com with NNFMP; 09 Aug 2012 15:25:35 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 556069.51755.bm@omp1006.mail.ne1.yahoo.com Received: (qmail 87455 invoked by uid 60001); 9 Aug 2012 15:25:35 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1344525935; bh=IjXi1sDTH/UAeQecTSjbaYs4pqG19SmqQIeDT2F8qeo=; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=g0fRpm57EwckMTufOzXz98RCTlWJ01epAuzyqsVRqf0nSF/0zHjO/9dcYXwwIgqDja9M8XiGIo9kQ1+wPGfYx8oiT4EENGIFPZyUXTcvaq8Hil2QeXdDN16wQlpcnRW16+JJgr8z7WdQ83H1aqKgl0bIf6VvlE0Rl5WaB+rSCq0= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=0IXvPhWXn/ieiVKnmJRh1Ih02eJyaV9v8In7kojO5De7ccLGIQQGBZ7x7vqiPLpHqlRr0dvFtyoHIDcwILzFkdjjAJpQ7X7Ejr2JCEjVN1ElO0GmFFHTm+z4dhHxSd397C2nf3KoNuV3H4w4sOU69UdOsmXe6t6h6x7xcZZOoXc=; X-YMail-OSG: RqJc.8QVM1niqNAzfEF1FRkMUt61GWGzQQ.wTc_KkNSj6ds o0CajwFp8WRg257pHo72uJhmT3LCU1ggrEEPg0AW6Lzpw7GWyNMSIJkwF2RY MDLnSRX.1PQDECLoePiW7vONoAzy4uhdT8qDzqcbri1FAKALzSWdku3U9jHt 9cTkEI8yWkMFMoPSBFVQRcIp6gWaZuE5RXCDYlbfJdjkyx3PXC_iisJoOE3t q73NCUFrkUS_SD.S23ZQlCjwDisiqU0sTd5FgLci8kqt4TO8wCh3cDvWpQgR 0kCp3eZnGfQ9oD_jyVGJJ6N6p5IjVc5WNeqT4p7aiBlnG6Ch1BJK7BOlHkc7 mr_RZgbS1uVsecpCa.kH.YgZzuM2vuustYqzdZdY.tJk7d9H9BS7WVWXjiqW zf.C87OTf7CEvlxbWyx_uS8P7pjtvUB33G21u2b0aky9m2Bs6mIVygKiuSjN oOD4UiBsTl0dXL33jOea9sm7yF_XIU5VmnTvFJROZDpxOtBYWBJU- Received: from [174.48.129.108] by web121605.mail.ne1.yahoo.com via HTTP; Thu, 09 Aug 2012 08:25:35 PDT X-Mailer: YahooMailClassic/15.0.8 YahooMailWebService/0.8.120.356233 Message-ID: <1344525935.85341.YahooMailClassic@web121605.mail.ne1.yahoo.com> Date: Thu, 9 Aug 2012 08:25:35 -0700 (PDT) From: Barney Cordoba To: John Baldwin , Konstantin Belousov In-Reply-To: <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: jfv@freebsd.org, Jack Vogel , net@freebsd.org Subject: Re: 82574L hangs (with r233708 e1000 driver). X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 09 Aug 2012 15:28:05 -0000 =0A=0A--- On Fri, 5/11/12, Barney Cordoba wrote:= =0A=0A> From: Barney Cordoba =0A> Subject: Re: 82= 574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" , "Konstantin Belousov" =0A> Cc: jfv@freebsd.org,= "Jack Vogel" , net@freebsd.org=0A> Date: Friday, May 11= , 2012, 6:24 PM=0A> =0A> =0A> --- On Tue, 5/8/12, Konstantin Belousov =0A> wrote:=0A> =0A> > From: Konstantin Belousov =0A> > Subject: Re: 82574L hangs (with r233708 e1000 driver).=0A= > > To: "John Baldwin" =0A> > Cc: jfv@freebsd.org,=0A> "Ja= ck Vogel" ,=0A> net@freebsd.org=0A> > Date: Tuesday, May= 8, 2012, 4:24 AM=0A> > On Mon, May 07, 2012 at 01:44:57PM=0A> > -0400, Joh= n Baldwin wrote:=0A> > > On Friday, May 04, 2012 6:18:19 pm Konstantin=0A> = Belousov=0A> > wrote:=0A> > > > On Fri, May 04, 2012 at 11:30:22AM -0400,= =0A> John=0A> > Baldwin wrote:=0A> > > > > On Tuesday, May 01, 2012 12:21:2= 1 pm=0A> > Konstantin Belousov wrote:=0A> > > > > > On Thu, Apr 12, 2012 at= 09:38:49PM=0A> > +0300, Konstantin Belousov wrote:=0A> > > > > > > On Mon,= Apr 09, 2012 at=0A> 12:19:39PM=0A> > -0400, John Baldwin wrote:=0A> > > > = > > > > On Sunday, April 08,=0A> 2012=0A> > 1:11:25 am Konstantin Belousov = wrote:=0A> > > > > > > > > On Sat, Apr 07, 2012=0A> at=0A> > 04:22:07PM -07= 00, Jack Vogel wrote:=0A> > > > > > > > > > Make sure you=0A> have=0A> > an= y firmware up to the latest available, if that =0A> > > > > doesn't=0A> > >= > > > > > > > help=0A> > > > > > > > > > let me know and=0A> I'll=0A> > ch= eck internally to see if there are any =0A> > > > > outstanding=0A> > > > >= > > > > > issues=0A> > > > > > > > > > in shared=0A> > code,=A0 that will = be after the weekend.=0A> > > > > > > > > =0A> > > > > > > > > I had BIOS r= ev.=0A> 151,=0A> > after you hint I found rev. 154 on the site.=0A> > > > >= > > > > Now BIOS reports=0A> itself=0A> > as MTCDT10N.86A.0154.2012.0323.1= 601,=0A> > > > > > > > > March 23.=0A> > > > > > > > > =0A> > > > > > > > >= Unfortunately,=0A> upgrade=0A> > did not changed anything in regard of han= ging=0A> > > > > > > > > interface.=0A> > > > > > > > =0A> > > > > > > > Do= es reverting 233708=0A> make any=0A> > difference?=A0 Have you tried futzin= g =0A> > > > > around with=0A> > > > > > > > kgdb when it is hung to=0A> se= e=0A> > what state the device is in (software state =0A> > > > > at=0A> > >= > > > > > least)?=0A> > > > > > > It does, in a sense that=0A> without=0A>= > r233708 the interface becomes stuck=0A> > > > > > > almost immediately. = I just=0A> upgraded=0A> > to the e1000@r234154, which does not=0A> > > > > = > > change much.=0A> > > > > > > =0A> > > > > > > I fiddled with the adapte= r=0A> state=0A> > after the hang in kgdb more, and I=0A> > > > > > > noted = something interesting.=0A> > Apparently, tx works. When I ping the remote= =0A> > > > > > > host from my suffering atom=0A> > machine, remote host see= s the packet. Also=0A> > > > > > > remote machine sees some udp=0A> > traff= ic originating from the tom, like=0A> > > > > > > ntp queries.=0A> > > > > = > > =0A> > > > > > > And, on receive, the atom=0A> board=0A> > does receive= interrupts, em0:rx 0 counter=0A> > > > > > > in vmstat -i increases. Even= =0A> more=0A> > fun, the sysctl dev.em.0.debug=0A> > > > > > > shows increa= sing hw rdh (as I=0A> > understand, this is hardware 'last=0A> > > > > > > = received' packet pointer for=0A> rx=0A> > ring). So I looked at the packet= =0A> > > > > > > descriptor at hw rdt index,=0A> and=0A> > there I see=0A> = > > > > > > (kgdb) p/x ((struct adapter=0A> > *)0xffffff80010e4000)->rx_rin= gs->rx_base[78]=0A> > > > > > > $11 =3D {buffer_addr =3D=0A> 0x12a128800,= =0A> > length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A> > > > > 0x0, =0A>= > > > > > >=A0=A0=A0errors =3D 0x0,=0A> > special =3D 0x0}=0A> > > > > > >= =0A> > > > > > > Apparently, the Descriptor=0A> Done bit=0A> > is clear, s= o the em_rxeof() function=0A> > > > > > > breaks from the loop, not=0A> con= suming=0A> > the current packet. Also, it returns=0A> > > > > > > false due= to DD bit clear.=0A> This=0A> > prevents em_msix_rx() from scheduling=0A> = > > > > > > taskqueue for processing. So=0A> > apparent cause for the hang = is missing=0A> > > > > > > DD bit in descriptor.=0A> > > > > > > =0A> > > >= > > > I am not sure isn't all this=0A> is=0A> > obvious for anybody who kn= ows em=0A> > > > > > > internals, and were to go=0A> from=0A> > there.=0A> = > > > > > =0A> > > > > > Ok, nobody cares.=0A> > > > > > =0A> > > > > > Bel= ow is the workaround I use to=0A> prevent=0A> > the interface wedging.=0A> = > > > > > It seems that the sole PCI register=0A> read=0A> > (namely, the r= x ring head read)=0A> > > > > > and consequent recheck of the=0A> descripto= r=0A> > status greatly reduce the=0A> > > > > > likelihood of the issue.=0A= > Unfortunately,=0A> > the read does not eliminate=0A> > > > > > the hang c= ompletely. So it is not=0A> some=0A> > PCIe coherency problem.=0A> > > > > = > =0A> > > > > > With the patch applied, I am able=0A> to=0A> > copy around= blu-ray images, while=0A> > > > > > previously the interface hang in=0A> 2= 0-30=0A> > seconds of 100Mbit/s traffic.=0A> > > > > > Sometimes the messag= es are=0A> printed:=0A> > > > > > em0: Workaround: head 1018 tail=0A> 1002 = cur=0A> > 1010=0A> > > > > > em0: Workaround: head 976 tail 973=0A> cur=0A>= > 974=0A> > > > > > em0: Workaround: head 950 tail 939=0A> cur=0A> > 946= =0A> > > > > > em0: Workaround: head 435 tail 419=0A> cur=0A> > 426=0A> > >= > > > =0A> > > > > > Machine is still dead due to=0A> random=0A> > memory = corruption which I see, in=0A> > > > > > particular, pmap sometimes read=0A= > garbage=0A> > from PTEs. I have no idea is=0A> > > > > > it related to em= 0 rx descriptor=0A> missed=0A> > writes, or is a different issue.=0A> > > >= > =0A> > > > > Humm, so if I'm reading this correctly,=0A> the=0A> > card = "skips" a receive=0A> > > > > descriptor and stores a packet at the=0A> nex= t=0A> > descriptor?=A0 That's just=0A> > > > > bizarre.=0A> > > > Either th= is, or it does store the packet but=0A> > 'forgots' to update the=0A> > > >= rx descriptor. I think that your=0A> interpretation is=0A> > closer to rea= lity,=0A> > > > since I get sustained 20MB/s over ssh with=0A> the=0A> > pa= tch even when workaround=0A> > > > activates. The lost packets probably sho= uld=0A> cause=0A> > retransmit and speed=0A> > > > drop.=0A> > > =0A> > > T= his is just weird.=A0 I wonder if there is a=0A> known=0A> > errata for thi= s?=0A> > > This really seems to be broken hardware and not a=0A> > driver i= ssue.=0A> > I was not able to find anything even remotely=0A> resembling th= e=0A> > described=0A> > behaviour, in the publically available 82574L=0A> s= pecification=0A> > update. I looked=0A> > at rev. 3.5, dated January 2012.= =0A> > =0A> > I may indeed give up and relocate the hardware into=0A> trash= ,=0A> > but it would be=0A> > pity, since this is new shiny Intel Atom 2800= m/b. I am=0A> not=0A> > sure I can give=0A> > convincing arguments to supp= lier for warranty=0A> replacement.=0A> > =0A> > And, while I booted Debian = to apply f/w fix Jack=0A> > recommended, I did=0A> > quick test and interfa= ce looked stable.=0A> > =0A> > =0A> =0A> FWIW, I've got an X7SPE-HF-D525 MB= with 82574L running on a=0A> 7.0 driver=0A> that seems to work pretty well= . It panics once in a blue=0A> moon when we=0A> overload it (like 200Mb/s o= f traffic) but it generally works=0A> ok.=0A> =0A> BC=0A=0AHas anything bee= n done or patched regarding this problem?=0A=0ABC