Date: Fri, 11 May 2012 15:24:29 -0700 (PDT) From: Barney Cordoba <barney_cordoba@yahoo.com> To: John Baldwin <jhb@freebsd.org>, Konstantin Belousov <kostikbel@gmail.com> Cc: jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, net@freebsd.org Subject: Re: 82574L hangs (with r233708 e1000 driver). Message-ID: <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com> In-Reply-To: <20120508082403.GS2358@deviant.kiev.zoral.com.ua>
next in thread | previous in thread | raw e-mail | index | archive | help
=0A=0A--- On Tue, 5/8/12, Konstantin Belousov <kostikbel@gmail.com> wrote:= =0A=0A> From: Konstantin Belousov <kostikbel@gmail.com>=0A> Subject: Re: 82= 574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" <jhb@freebsd= .org>=0A> Cc: jfv@freebsd.org, "Jack Vogel" <jfvogel@gmail.com>, net@freebs= d.org=0A> Date: Tuesday, May 8, 2012, 4:24 AM=0A> On Mon, May 07, 2012 at 0= 1:44:57PM=0A> -0400, John Baldwin wrote:=0A> > On Friday, May 04, 2012 6:18= :19 pm Konstantin Belousov=0A> wrote:=0A> > > On Fri, May 04, 2012 at 11:30= :22AM -0400, John=0A> Baldwin wrote:=0A> > > > On Tuesday, May 01, 2012 12:= 21:21 pm=0A> Konstantin Belousov wrote:=0A> > > > > On Thu, Apr 12, 2012 at= 09:38:49PM=0A> +0300, Konstantin Belousov wrote:=0A> > > > > > On Mon, Apr= 09, 2012 at 12:19:39PM=0A> -0400, John Baldwin wrote:=0A> > > > > > > On S= unday, April 08, 2012=0A> 1:11:25 am Konstantin Belousov wrote:=0A> > > > >= > > > On Sat, Apr 07, 2012 at=0A> 04:22:07PM -0700, Jack Vogel wrote:=0A> = > > > > > > > > Make sure you have=0A> any firmware up to the latest availa= ble, if that =0A> > > > doesn't=0A> > > > > > > > > help=0A> > > > > > > > = > let me know and I'll=0A> check internally to see if there are any =0A> > = > > outstanding=0A> > > > > > > > > issues=0A> > > > > > > > > in shared=0A= > code,=A0 that will be after the weekend.=0A> > > > > > > > =0A> > > > > >= > > I had BIOS rev. 151,=0A> after you hint I found rev. 154 on the site.= =0A> > > > > > > > Now BIOS reports itself=0A> as MTCDT10N.86A.0154.2012.03= 23.1601,=0A> > > > > > > > March 23.=0A> > > > > > > > =0A> > > > > > > > U= nfortunately, upgrade=0A> did not changed anything in regard of hanging=0A>= > > > > > > > interface.=0A> > > > > > > =0A> > > > > > > Does reverting 2= 33708 make any=0A> difference?=A0 Have you tried futzing =0A> > > > around = with=0A> > > > > > > kgdb when it is hung to see=0A> what state the device = is in (software state =0A> > > > at=0A> > > > > > > least)?=0A> > > > > > I= t does, in a sense that without=0A> r233708 the interface becomes stuck=0A>= > > > > > almost immediately. I just upgraded=0A> to the e1000@r234154, wh= ich does not=0A> > > > > > change much.=0A> > > > > > =0A> > > > > > I fidd= led with the adapter state=0A> after the hang in kgdb more, and I=0A> > > >= > > noted something interesting.=0A> Apparently, tx works. When I ping the= remote=0A> > > > > > host from my suffering atom=0A> machine, remote host = sees the packet. Also=0A> > > > > > remote machine sees some udp=0A> traffi= c originating from the tom, like=0A> > > > > > ntp queries.=0A> > > > > > = =0A> > > > > > And, on receive, the atom board=0A> does receive interrupts,= em0:rx 0 counter=0A> > > > > > in vmstat -i increases. Even more=0A> fun, = the sysctl dev.em.0.debug=0A> > > > > > shows increasing hw rdh (as I=0A> u= nderstand, this is hardware 'last=0A> > > > > > received' packet pointer fo= r rx=0A> ring). So I looked at the packet=0A> > > > > > descriptor at hw rd= t index, and=0A> there I see=0A> > > > > > (kgdb) p/x ((struct adapter=0A> = *)0xffffff80010e4000)->rx_rings->rx_base[78]=0A> > > > > > $11 =3D {buffer_= addr =3D 0x12a128800,=0A> length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A= > > > > 0x0, =0A> > > > > >=A0=A0=A0errors =3D 0x0,=0A> special =3D 0x0}=0A= > > > > > > =0A> > > > > > Apparently, the Descriptor Done bit=0A> is clear= , so the em_rxeof() function=0A> > > > > > breaks from the loop, not consum= ing=0A> the current packet. Also, it returns=0A> > > > > > false due to DD = bit clear. This=0A> prevents em_msix_rx() from scheduling=0A> > > > > > tas= kqueue for processing. So=0A> apparent cause for the hang is missing=0A> > = > > > > DD bit in descriptor.=0A> > > > > > =0A> > > > > > I am not sure is= n't all this is=0A> obvious for anybody who knows em=0A> > > > > > internal= s, and were to go from=0A> there.=0A> > > > > =0A> > > > > Ok, nobody cares= .=0A> > > > > =0A> > > > > Below is the workaround I use to prevent=0A> the= interface wedging.=0A> > > > > It seems that the sole PCI register read=0A= > (namely, the rx ring head read)=0A> > > > > and consequent recheck of the= descriptor=0A> status greatly reduce the=0A> > > > > likelihood of the iss= ue. Unfortunately,=0A> the read does not eliminate=0A> > > > > the hang com= pletely. So it is not some=0A> PCIe coherency problem.=0A> > > > > =0A> > >= > > With the patch applied, I am able to=0A> copy around blu-ray images, w= hile=0A> > > > > previously the interface hang in 20-30=0A> seconds of 100M= bit/s traffic.=0A> > > > > Sometimes the messages are printed:=0A> > > > > = em0: Workaround: head 1018 tail 1002 cur=0A> 1010=0A> > > > > em0: Workarou= nd: head 976 tail 973 cur=0A> 974=0A> > > > > em0: Workaround: head 950 tai= l 939 cur=0A> 946=0A> > > > > em0: Workaround: head 435 tail 419 cur=0A> 42= 6=0A> > > > > =0A> > > > > Machine is still dead due to random=0A> memory c= orruption which I see, in=0A> > > > > particular, pmap sometimes read garba= ge=0A> from PTEs. I have no idea is=0A> > > > > it related to em0 rx descri= ptor missed=0A> writes, or is a different issue.=0A> > > > =0A> > > > Humm,= so if I'm reading this correctly, the=0A> card "skips" a receive=0A> > > >= descriptor and stores a packet at the next=0A> descriptor?=A0 That's just= =0A> > > > bizarre.=0A> > > Either this, or it does store the packet but=0A= > 'forgots' to update the=0A> > > rx descriptor. I think that your interpre= tation is=0A> closer to reality,=0A> > > since I get sustained 20MB/s over = ssh with the=0A> patch even when workaround=0A> > > activates. The lost pac= kets probably should cause=0A> retransmit and speed=0A> > > drop.=0A> > =0A= > > This is just weird.=A0 I wonder if there is a known=0A> errata for this= ?=0A> > This really seems to be broken hardware and not a=0A> driver issue.= =0A> I was not able to find anything even remotely resembling the=0A> descr= ibed=0A> behaviour, in the publically available 82574L specification=0A> up= date. I looked=0A> at rev. 3.5, dated January 2012.=0A> =0A> I may indeed g= ive up and relocate the hardware into trash,=0A> but it would be=0A> pity, = since this is new shiny Intel Atom 2800 m/b. I am not=0A> sure I can give= =0A> convincing arguments to supplier for warranty replacement.=0A> =0A> An= d, while I booted Debian to apply f/w fix Jack=0A> recommended, I did=0A> q= uick test and interface looked stable.=0A> =0A> =0A=0AFWIW, I've got an X7S= PE-HF-D525 MB with 82574L running on a 7.0 driver=0Athat seems to work pret= ty well. It panics once in a blue moon when we=0Aoverload it (like 200Mb/s = of traffic) but it generally works ok.=0A=0ABC
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?1336775069.17927.YahooMailClassic>