Date: Thu, 9 Aug 2012 08:25:35 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> To: John Baldwin <jhb@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com> Cc: jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org Subject: Re: 82574L hangs (with r233708 e1000 driver). Message-ID: <1344525935.85341.YahooMailClassic@web121605.mail.ne1.yahoo.com> In-Reply-To: <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com>
next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Fri, 5/11/12, Barney Cordoba <barney_cordoba@yahoo.com> wrote:= =0A=0A> From: Barney Cordoba <barney_cordoba@yahoo.com>=0A> Subject: Re: 82= 574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" <jhb@freebsd= .org>, "Konstantin Belousov" <kostikbel@gmail.com>=0A> Cc: jfv@freebsd.org,= "Jack Vogel" <jfvogel@gmail.com>, net@freebsd.org=0A> Date: Friday, May 11= , 2012, 6:24 PM=0A> =0A> =0A> --- On Tue, 5/8/12, Konstantin Belousov <kost= ikbel@gmail.com>=0A> wrote:=0A> =0A> > From: Konstantin Belousov <kostikbel= @gmail.com>=0A> > Subject: Re: 82574L hangs (with r233708 e1000 driver).=0A= > > To: "John Baldwin" <jhb@freebsd.org>=0A> > Cc: jfv@freebsd.org,=0A> "Ja= ck Vogel" <jfvogel@gmail.com>,=0A> net@freebsd.org=0A> > Date: Tuesday, May= 8, 2012, 4:24 AM=0A> > On Mon, May 07, 2012 at 01:44:57PM=0A> > -0400, Joh= n Baldwin wrote:=0A> > > On Friday, May 04, 2012 6:18:19 pm Konstantin=0A> = Belousov=0A> > wrote:=0A> > > > On Fri, May 04, 2012 at 11:30:22AM -0400,= =0A> John=0A> > Baldwin wrote:=0A> > > > > On Tuesday, May 01, 2012 12:21:2= 1 pm=0A> > Konstantin Belousov wrote:=0A> > > > > > On Thu, Apr 12, 2012 at= 09:38:49PM=0A> > +0300, Konstantin Belousov wrote:=0A> > > > > > > On Mon,= Apr 09, 2012 at=0A> 12:19:39PM=0A> > -0400, John Baldwin wrote:=0A> > > > = > > > > On Sunday, April 08,=0A> 2012=0A> > 1:11:25 am Konstantin Belousov = wrote:=0A> > > > > > > > > On Sat, Apr 07, 2012=0A> at=0A> > 04:22:07PM -07= 00, Jack Vogel wrote:=0A> > > > > > > > > > Make sure you=0A> have=0A> > an= y firmware up to the latest available, if that =0A> > > > > doesn't=0A> > >= > > > > > > > help=0A> > > > > > > > > > let me know and=0A> I'll=0A> > ch= eck internally to see if there are any =0A> > > > > outstanding=0A> > > > >= > > > > > issues=0A> > > > > > > > > > in shared=0A> > code,=A0 that will = be after the weekend.=0A> > > > > > > > > =0A> > > > > > > > > I had BIOS r= ev.=0A> 151,=0A> > after you hint I found rev. 154 on the site.=0A> > > > >= > > > > Now BIOS reports=0A> itself=0A> > as MTCDT10N.86A.0154.2012.0323.1= 601,=0A> > > > > > > > > March 23.=0A> > > > > > > > > =0A> > > > > > > > >= Unfortunately,=0A> upgrade=0A> > did not changed anything in regard of han= ging=0A> > > > > > > > > interface.=0A> > > > > > > > =0A> > > > > > > > Do= es reverting 233708=0A> make any=0A> > difference?=A0 Have you tried futzin= g =0A> > > > > around with=0A> > > > > > > > kgdb when it is hung to=0A> se= e=0A> > what state the device is in (software state =0A> > > > > at=0A> > >= > > > > > least)?=0A> > > > > > > It does, in a sense that=0A> without=0A>= > r233708 the interface becomes stuck=0A> > > > > > > almost immediately. = I just=0A> upgraded=0A> > to the e1000@r234154, which does not=0A> > > > > = > > change much.=0A> > > > > > > =0A> > > > > > > I fiddled with the adapte= r=0A> state=0A> > after the hang in kgdb more, and I=0A> > > > > > > noted = something interesting.=0A> > Apparently, tx works. When I ping the remote= =0A> > > > > > > host from my suffering atom=0A> > machine, remote host see= s the packet. Also=0A> > > > > > > remote machine sees some udp=0A> > traff= ic originating from the tom, like=0A> > > > > > > ntp queries.=0A> > > > > = > > =0A> > > > > > > And, on receive, the atom=0A> board=0A> > does receive= interrupts, em0:rx 0 counter=0A> > > > > > > in vmstat -i increases. Even= =0A> more=0A> > fun, the sysctl dev.em.0.debug=0A> > > > > > > shows increa= sing hw rdh (as I=0A> > understand, this is hardware 'last=0A> > > > > > > = received' packet pointer for=0A> rx=0A> > ring). So I looked at the packet= =0A> > > > > > > descriptor at hw rdt index,=0A> and=0A> > there I see=0A> = > > > > > > (kgdb) p/x ((struct adapter=0A> > *)0xffffff80010e4000)->rx_rin= gs->rx_base[78]=0A> > > > > > > $11 =3D {buffer_addr =3D=0A> 0x12a128800,= =0A> > length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A> > > > > 0x0, =0A>= > > > > > >=A0=A0=A0errors =3D 0x0,=0A> > special =3D 0x0}=0A> > > > > > >= =0A> > > > > > > Apparently, the Descriptor=0A> Done bit=0A> > is clear, s= o the em_rxeof() function=0A> > > > > > > breaks from the loop, not=0A> con= suming=0A> > the current packet. Also, it returns=0A> > > > > > > false due= to DD bit clear.=0A> This=0A> > prevents em_msix_rx() from scheduling=0A> = > > > > > > taskqueue for processing. So=0A> > apparent cause for the hang = is missing=0A> > > > > > > DD bit in descriptor.=0A> > > > > > > =0A> > > >= > > > I am not sure isn't all this=0A> is=0A> > obvious for anybody who kn= ows em=0A> > > > > > > internals, and were to go=0A> from=0A> > there.=0A> = > > > > > =0A> > > > > > Ok, nobody cares.=0A> > > > > > =0A> > > > > > Bel= ow is the workaround I use to=0A> prevent=0A> > the interface wedging.=0A> = > > > > > It seems that the sole PCI register=0A> read=0A> > (namely, the r= x ring head read)=0A> > > > > > and consequent recheck of the=0A> descripto= r=0A> > status greatly reduce the=0A> > > > > > likelihood of the issue.=0A= > Unfortunately,=0A> > the read does not eliminate=0A> > > > > > the hang c= ompletely. So it is not=0A> some=0A> > PCIe coherency problem.=0A> > > > > = > =0A> > > > > > With the patch applied, I am able=0A> to=0A> > copy around= blu-ray images, while=0A> > > > > > previously the interface hang in=0A> 2= 0-30=0A> > seconds of 100Mbit/s traffic.=0A> > > > > > Sometimes the messag= es are=0A> printed:=0A> > > > > > em0: Workaround: head 1018 tail=0A> 1002 = cur=0A> > 1010=0A> > > > > > em0: Workaround: head 976 tail 973=0A> cur=0A>= > 974=0A> > > > > > em0: Workaround: head 950 tail 939=0A> cur=0A> > 946= =0A> > > > > > em0: Workaround: head 435 tail 419=0A> cur=0A> > 426=0A> > >= > > > =0A> > > > > > Machine is still dead due to=0A> random=0A> > memory = corruption which I see, in=0A> > > > > > particular, pmap sometimes read=0A= > garbage=0A> > from PTEs. I have no idea is=0A> > > > > > it related to em= 0 rx descriptor=0A> missed=0A> > writes, or is a different issue.=0A> > > >= > =0A> > > > > Humm, so if I'm reading this correctly,=0A> the=0A> > card = "skips" a receive=0A> > > > > descriptor and stores a packet at the=0A> nex= t=0A> > descriptor?=A0 That's just=0A> > > > > bizarre.=0A> > > > Either th= is, or it does store the packet but=0A> > 'forgots' to update the=0A> > > >= rx descriptor. I think that your=0A> interpretation is=0A> > closer to rea= lity,=0A> > > > since I get sustained 20MB/s over ssh with=0A> the=0A> > pa= tch even when workaround=0A> > > > activates. The lost packets probably sho= uld=0A> cause=0A> > retransmit and speed=0A> > > > drop.=0A> > > =0A> > > T= his is just weird.=A0 I wonder if there is a=0A> known=0A> > errata for thi= s?=0A> > > This really seems to be broken hardware and not a=0A> > driver i= ssue.=0A> > I was not able to find anything even remotely=0A> resembling th= e=0A> > described=0A> > behaviour, in the publically available 82574L=0A> s= pecification=0A> > update. I looked=0A> > at rev. 3.5, dated January 2012.= =0A> > =0A> > I may indeed give up and relocate the hardware into=0A> trash= ,=0A> > but it would be=0A> > pity, since this is new shiny Intel Atom 2800= m/b. I am=0A> not=0A> > sure I can give=0A> > convincing arguments to supp= lier for warranty=0A> replacement.=0A> > =0A> > And, while I booted Debian = to apply f/w fix Jack=0A> > recommended, I did=0A> > quick test and interfa= ce looked stable.=0A> > =0A> > =0A> =0A> FWIW, I've got an X7SPE-HF-D525 MB= with 82574L running on a=0A> 7.0 driver=0A> that seems to work pretty well= . It panics once in a blue=0A> moon when we=0A> overload it (like 200Mb/s o= f traffic) but it generally works=0A> ok.=0A> =0A> BC=0A=0AHas anything bee= n done or patched regarding this problem?=0A=0ABC
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1344525935.85341.YahooMailClassic>