From owner-freebsd-net@FreeBSD.ORG Fri May 11 22:26:13 2012 Return-Path: Delivered-To: net@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id BFBA0106566B for ; Fri, 11 May 2012 22:26:13 +0000 (UTC) (envelope-from barney_cordoba@yahoo.com) Received: from nm5-vm2.bullet.mail.ne1.yahoo.com (nm5-vm2.bullet.mail.ne1.yahoo.com [98.138.90.153]) by mx1.freebsd.org (Postfix) with SMTP id 803FC8FC0C for ; Fri, 11 May 2012 22:26:13 +0000 (UTC) Received: from [98.138.90.52] by nm5.bullet.mail.ne1.yahoo.com with NNFMP; 11 May 2012 22:24:29 -0000 Received: from [98.138.226.167] by tm5.bullet.mail.ne1.yahoo.com with NNFMP; 11 May 2012 22:24:29 -0000 Received: from [127.0.0.1] by omp1068.mail.ne1.yahoo.com with NNFMP; 11 May 2012 22:24:29 -0000 X-Yahoo-Newman-Property: ymail-3 X-Yahoo-Newman-Id: 811198.66986.bm@omp1068.mail.ne1.yahoo.com Received: (qmail 18360 invoked by uid 60001); 11 May 2012 22:24:29 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1336775069; bh=ckvCrGLX/eHxGicVxofOoeqP6PPpApBia9i1XsEGLIs=; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=j8l791wSsfcrMPR1XHY35ZXe4fVjljeAKKCHiSWj4x5AFZ2tl77zltsfH5dib4v6hsIVkYnemaMzAqiUgfeyeiDGXSF3u/fNJLp6kwI8rZc/owshJGMAe6RoyeZX578/tNHnE5r+4htKwtNKWITQSrqJJnFwuGUNS2XSWNL7ZeI= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:Message-ID:Date:From:Subject:To:Cc:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=SUKp+xnlcAI/eCnHLuZxlcWD17yxuW1OU6ayCQMPNd4Na7KdhXIyWkAW18ktAI+SBUqxWkAcu4jOFChesgSnTpjzNsdq/Htv9kw9Er2pTiddm6MjvFY9scDT74tPo9N3pLn0q5DeOBu+A4TfcYWhI0mW/yAE0gIWD2O2If6XwDY=; X-YMail-OSG: Tt2wvx0VM1lQTGTP7_L6wmfBR_wirbb3gL06dDLUCERhVmZ ghI9igettaf1kphaRSjOucRdOcQYa1ORYkheXnxe0MD7x8EGOLdQenriQeeF kfRnhX9N2EpoK9DGX0w8mmGj2oRtdezA8aXhk58gEzNxAFivFBHB2_JFe28y etZnL7.IrlhfCY_MdXBoglLW6kjgT.RcvJSePkTNIfH5MXlpBx9Tw98oU6Bw o.XHTd0qkPYA.h29wMimHj_4A868d.wo6aQgkuOC5_EKiAcUSnWK3DSLcgV3 Ebw3fo42tQKJxZUEdmbRLuDh8V8w05kOGhG6yyhLFXEv1uibfSulM62nS2kV 5zg1OOxZDBT.Ium8FLX6pwqoSEd5HC5KQu0Y_1WoltAKBjuRrGCVhSibtknJ 2q6VTdwBS3CQTrh9NLb9SXJ9FEtYVm6OFRtNm0hHpfSPkRIVYOdc9J_.pIwd tiLRE2LuCVPvsAwDAOSmiCzcb78.RaYIPgqo- Received: from [174.48.129.108] by web126002.mail.ne1.yahoo.com via HTTP; Fri, 11 May 2012 15:24:29 PDT X-Mailer: YahooMailClassic/15.0.6 YahooMailWebService/0.8.118.349524 Message-ID: <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com> Date: Fri, 11 May 2012 15:24:29 -0700 (PDT) From: Barney Cordoba To: John Baldwin , Konstantin Belousov In-Reply-To: <20120508082403.GS2358@deviant.kiev.zoral.com.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Cc: jfv@freebsd.org, Jack Vogel , net@freebsd.org Subject: Re: 82574L hangs (with r233708 e1000 driver). X-BeenThere: freebsd-net@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Networking and TCP/IP with FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 11 May 2012 22:26:13 -0000 =0A=0A--- On Tue, 5/8/12, Konstantin Belousov wrote:= =0A=0A> From: Konstantin Belousov =0A> Subject: Re: 82= 574L hangs (with r233708 e1000 driver).=0A> To: "John Baldwin" =0A> Cc: jfv@freebsd.org, "Jack Vogel" , net@freebs= d.org=0A> Date: Tuesday, May 8, 2012, 4:24 AM=0A> On Mon, May 07, 2012 at 0= 1:44:57PM=0A> -0400, John Baldwin wrote:=0A> > On Friday, May 04, 2012 6:18= :19 pm Konstantin Belousov=0A> wrote:=0A> > > On Fri, May 04, 2012 at 11:30= :22AM -0400, John=0A> Baldwin wrote:=0A> > > > On Tuesday, May 01, 2012 12:= 21:21 pm=0A> Konstantin Belousov wrote:=0A> > > > > On Thu, Apr 12, 2012 at= 09:38:49PM=0A> +0300, Konstantin Belousov wrote:=0A> > > > > > On Mon, Apr= 09, 2012 at 12:19:39PM=0A> -0400, John Baldwin wrote:=0A> > > > > > > On S= unday, April 08, 2012=0A> 1:11:25 am Konstantin Belousov wrote:=0A> > > > >= > > > On Sat, Apr 07, 2012 at=0A> 04:22:07PM -0700, Jack Vogel wrote:=0A> = > > > > > > > > Make sure you have=0A> any firmware up to the latest availa= ble, if that =0A> > > > doesn't=0A> > > > > > > > > help=0A> > > > > > > > = > let me know and I'll=0A> check internally to see if there are any =0A> > = > > outstanding=0A> > > > > > > > > issues=0A> > > > > > > > > in shared=0A= > code,=A0 that will be after the weekend.=0A> > > > > > > > =0A> > > > > >= > > I had BIOS rev. 151,=0A> after you hint I found rev. 154 on the site.= =0A> > > > > > > > Now BIOS reports itself=0A> as MTCDT10N.86A.0154.2012.03= 23.1601,=0A> > > > > > > > March 23.=0A> > > > > > > > =0A> > > > > > > > U= nfortunately, upgrade=0A> did not changed anything in regard of hanging=0A>= > > > > > > > interface.=0A> > > > > > > =0A> > > > > > > Does reverting 2= 33708 make any=0A> difference?=A0 Have you tried futzing =0A> > > > around = with=0A> > > > > > > kgdb when it is hung to see=0A> what state the device = is in (software state =0A> > > > at=0A> > > > > > > least)?=0A> > > > > > I= t does, in a sense that without=0A> r233708 the interface becomes stuck=0A>= > > > > > almost immediately. I just upgraded=0A> to the e1000@r234154, wh= ich does not=0A> > > > > > change much.=0A> > > > > > =0A> > > > > > I fidd= led with the adapter state=0A> after the hang in kgdb more, and I=0A> > > >= > > noted something interesting.=0A> Apparently, tx works. When I ping the= remote=0A> > > > > > host from my suffering atom=0A> machine, remote host = sees the packet. Also=0A> > > > > > remote machine sees some udp=0A> traffi= c originating from the tom, like=0A> > > > > > ntp queries.=0A> > > > > > = =0A> > > > > > And, on receive, the atom board=0A> does receive interrupts,= em0:rx 0 counter=0A> > > > > > in vmstat -i increases. Even more=0A> fun, = the sysctl dev.em.0.debug=0A> > > > > > shows increasing hw rdh (as I=0A> u= nderstand, this is hardware 'last=0A> > > > > > received' packet pointer fo= r rx=0A> ring). So I looked at the packet=0A> > > > > > descriptor at hw rd= t index, and=0A> there I see=0A> > > > > > (kgdb) p/x ((struct adapter=0A> = *)0xffffff80010e4000)->rx_rings->rx_base[78]=0A> > > > > > $11 =3D {buffer_= addr =3D 0x12a128800,=0A> length =3D 0x5ea, csum =3D 0x3c2b, status =3D =0A= > > > > 0x0, =0A> > > > > >=A0=A0=A0errors =3D 0x0,=0A> special =3D 0x0}=0A= > > > > > > =0A> > > > > > Apparently, the Descriptor Done bit=0A> is clear= , so the em_rxeof() function=0A> > > > > > breaks from the loop, not consum= ing=0A> the current packet. Also, it returns=0A> > > > > > false due to DD = bit clear. This=0A> prevents em_msix_rx() from scheduling=0A> > > > > > tas= kqueue for processing. So=0A> apparent cause for the hang is missing=0A> > = > > > > DD bit in descriptor.=0A> > > > > > =0A> > > > > > I am not sure is= n't all this is=0A> obvious for anybody who knows em=0A> > > > > > internal= s, and were to go from=0A> there.=0A> > > > > =0A> > > > > Ok, nobody cares= .=0A> > > > > =0A> > > > > Below is the workaround I use to prevent=0A> the= interface wedging.=0A> > > > > It seems that the sole PCI register read=0A= > (namely, the rx ring head read)=0A> > > > > and consequent recheck of the= descriptor=0A> status greatly reduce the=0A> > > > > likelihood of the iss= ue. Unfortunately,=0A> the read does not eliminate=0A> > > > > the hang com= pletely. So it is not some=0A> PCIe coherency problem.=0A> > > > > =0A> > >= > > With the patch applied, I am able to=0A> copy around blu-ray images, w= hile=0A> > > > > previously the interface hang in 20-30=0A> seconds of 100M= bit/s traffic.=0A> > > > > Sometimes the messages are printed:=0A> > > > > = em0: Workaround: head 1018 tail 1002 cur=0A> 1010=0A> > > > > em0: Workarou= nd: head 976 tail 973 cur=0A> 974=0A> > > > > em0: Workaround: head 950 tai= l 939 cur=0A> 946=0A> > > > > em0: Workaround: head 435 tail 419 cur=0A> 42= 6=0A> > > > > =0A> > > > > Machine is still dead due to random=0A> memory c= orruption which I see, in=0A> > > > > particular, pmap sometimes read garba= ge=0A> from PTEs. I have no idea is=0A> > > > > it related to em0 rx descri= ptor missed=0A> writes, or is a different issue.=0A> > > > =0A> > > > Humm,= so if I'm reading this correctly, the=0A> card "skips" a receive=0A> > > >= descriptor and stores a packet at the next=0A> descriptor?=A0 That's just= =0A> > > > bizarre.=0A> > > Either this, or it does store the packet but=0A= > 'forgots' to update the=0A> > > rx descriptor. I think that your interpre= tation is=0A> closer to reality,=0A> > > since I get sustained 20MB/s over = ssh with the=0A> patch even when workaround=0A> > > activates. The lost pac= kets probably should cause=0A> retransmit and speed=0A> > > drop.=0A> > =0A= > > This is just weird.=A0 I wonder if there is a known=0A> errata for this= ?=0A> > This really seems to be broken hardware and not a=0A> driver issue.= =0A> I was not able to find anything even remotely resembling the=0A> descr= ibed=0A> behaviour, in the publically available 82574L specification=0A> up= date. I looked=0A> at rev. 3.5, dated January 2012.=0A> =0A> I may indeed g= ive up and relocate the hardware into trash,=0A> but it would be=0A> pity, = since this is new shiny Intel Atom 2800 m/b. I am not=0A> sure I can give= =0A> convincing arguments to supplier for warranty replacement.=0A> =0A> An= d, while I booted Debian to apply f/w fix Jack=0A> recommended, I did=0A> q= uick test and interface looked stable.=0A> =0A> =0A=0AFWIW, I've got an X7S= PE-HF-D525 MB with 82574L running on a 7.0 driver=0Athat seems to work pret= ty well. It panics once in a blue moon when we=0Aoverload it (like 200Mb/s = of traffic) but it generally works ok.=0A=0ABC