FreeBSD Mail Archives

Date:      Fri, 10 Aug 2012 11:26:08 +0300
From:      Konstantin Belousov <kostikbel@gmail.com>
To:        Barney Cordoba <barney_cordoba@yahoo.com>
Cc:        jfv@freebsd.org, Jack Vogel <jfvogel@gmail.com>, John Baldwin <jhb@freebsd.org>, net@freebsd.org
Subject:   Re: 82574L hangs (with r233708 e1000 driver).
Message-ID:  <20120810082608.GB2425@deviant.kiev.zoral.com.ua>
In-Reply-To: <1344525935.85341.YahooMailClassic@web121605.mail.ne1.yahoo.com>
References:  <1336775069.17927.YahooMailClassic@web126002.mail.ne1.yahoo.com> <1344525935.85341.YahooMailClassic@web121605.mail.ne1.yahoo.com>


--8GpibOaaTibBMecb
Content-Type: text/plain; charset=koi8-r
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Aug 09, 2012 at 08:25:35AM -0700, Barney Cordoba wrote:
>=20
>=20
> --- On Fri, 5/11/12, Barney Cordoba <barney_cordoba@yahoo.com> wrote:
>=20
> > From: Barney Cordoba <barney_cordoba@yahoo.com>
> > Subject: Re: 82574L hangs (with r233708 e1000 driver).
> > To: "John Baldwin" <jhb@freebsd.org>, "Konstantin Belousov" <kostikbel@=
gmail.com>
> > Cc: jfv@freebsd.org, "Jack Vogel" <jfvogel@gmail.com>, net@freebsd.org
> > Date: Friday, May 11, 2012, 6:24 PM
> >=20
> >=20
> > --- On Tue, 5/8/12, Konstantin Belousov <kostikbel@gmail.com>
> > wrote:
> >=20
> > > From: Konstantin Belousov <kostikbel@gmail.com>
> > > Subject: Re: 82574L hangs (with r233708 e1000 driver).
> > > To: "John Baldwin" <jhb@freebsd.org>
> > > Cc: jfv@freebsd.org,
> > "Jack Vogel" <jfvogel@gmail.com>,
> > net@freebsd.org
> > > Date: Tuesday, May 8, 2012, 4:24 AM
> > > On Mon, May 07, 2012 at 01:44:57PM
> > > -0400, John Baldwin wrote:
> > > > On Friday, May 04, 2012 6:18:19 pm Konstantin
> > Belousov
> > > wrote:
> > > > > On Fri, May 04, 2012 at 11:30:22AM -0400,
> > John
> > > Baldwin wrote:
> > > > > > On Tuesday, May 01, 2012 12:21:21 pm
> > > Konstantin Belousov wrote:
> > > > > > > On Thu, Apr 12, 2012 at 09:38:49PM
> > > +0300, Konstantin Belousov wrote:
> > > > > > > > On Mon, Apr 09, 2012 at
> > 12:19:39PM
> > > -0400, John Baldwin wrote:
> > > > > > > > > On Sunday, April 08,
> > 2012
> > > 1:11:25 am Konstantin Belousov wrote:
> > > > > > > > > > On Sat, Apr 07, 2012
> > at
> > > 04:22:07PM -0700, Jack Vogel wrote:
> > > > > > > > > > > Make sure you
> > have
> > > any firmware up to the latest available, if that=20
> > > > > > doesn't
> > > > > > > > > > > help
> > > > > > > > > > > let me know and
> > I'll
> > > check internally to see if there are any=20
> > > > > > outstanding
> > > > > > > > > > > issues
> > > > > > > > > > > in shared
> > > code,=9A that will be after the weekend.
> > > > > > > > > >=20
> > > > > > > > > > I had BIOS rev.
> > 151,
> > > after you hint I found rev. 154 on the site.
> > > > > > > > > > Now BIOS reports
> > itself
> > > as MTCDT10N.86A.0154.2012.0323.1601,
> > > > > > > > > > March 23.
> > > > > > > > > >=20
> > > > > > > > > > Unfortunately,
> > upgrade
> > > did not changed anything in regard of hanging
> > > > > > > > > > interface.
> > > > > > > > >=20
> > > > > > > > > Does reverting 233708
> > make any
> > > difference?=9A Have you tried futzing=20
> > > > > > around with
> > > > > > > > > kgdb when it is hung to
> > see
> > > what state the device is in (software state=20
> > > > > > at
> > > > > > > > > least)?
> > > > > > > > It does, in a sense that
> > without
> > > r233708 the interface becomes stuck
> > > > > > > > almost immediately. I just
> > upgraded
> > > to the e1000@r234154, which does not
> > > > > > > > change much.
> > > > > > > >=20
> > > > > > > > I fiddled with the adapter
> > state
> > > after the hang in kgdb more, and I
> > > > > > > > noted something interesting.
> > > Apparently, tx works. When I ping the remote
> > > > > > > > host from my suffering atom
> > > machine, remote host sees the packet. Also
> > > > > > > > remote machine sees some udp
> > > traffic originating from the tom, like
> > > > > > > > ntp queries.
> > > > > > > >=20
> > > > > > > > And, on receive, the atom
> > board
> > > does receive interrupts, em0:rx 0 counter
> > > > > > > > in vmstat -i increases. Even
> > more
> > > fun, the sysctl dev.em.0.debug
> > > > > > > > shows increasing hw rdh (as I
> > > understand, this is hardware 'last
> > > > > > > > received' packet pointer for
> > rx
> > > ring). So I looked at the packet
> > > > > > > > descriptor at hw rdt index,
> > and
> > > there I see
> > > > > > > > (kgdb) p/x ((struct adapter
> > > *)0xffffff80010e4000)->rx_rings->rx_base[78]
> > > > > > > > $11 =3D {buffer_addr =3D
> > 0x12a128800,
> > > length =3D 0x5ea, csum =3D 0x3c2b, status =3D=20
> > > > > > 0x0,=20
> > > > > > > >=9A=9A=9Aerrors =3D 0x0,
> > > special =3D 0x0}
> > > > > > > >=20
> > > > > > > > Apparently, the Descriptor
> > Done bit
> > > is clear, so the em_rxeof() function
> > > > > > > > breaks from the loop, not
> > consuming
> > > the current packet. Also, it returns
> > > > > > > > false due to DD bit clear.
> > This
> > > prevents em_msix_rx() from scheduling
> > > > > > > > taskqueue for processing. So
> > > apparent cause for the hang is missing
> > > > > > > > DD bit in descriptor.
> > > > > > > >=20
> > > > > > > > I am not sure isn't all this
> > is
> > > obvious for anybody who knows em
> > > > > > > > internals, and were to go
> > from
> > > there.
> > > > > > >=20
> > > > > > > Ok, nobody cares.
> > > > > > >=20
> > > > > > > Below is the workaround I use to
> > prevent
> > > the interface wedging.
> > > > > > > It seems that the sole PCI register
> > read
> > > (namely, the rx ring head read)
> > > > > > > and consequent recheck of the
> > descriptor
> > > status greatly reduce the
> > > > > > > likelihood of the issue.
> > Unfortunately,
> > > the read does not eliminate
> > > > > > > the hang completely. So it is not
> > some
> > > PCIe coherency problem.
> > > > > > >=20
> > > > > > > With the patch applied, I am able
> > to
> > > copy around blu-ray images, while
> > > > > > > previously the interface hang in
> > 20-30
> > > seconds of 100Mbit/s traffic.
> > > > > > > Sometimes the messages are
> > printed:
> > > > > > > em0: Workaround: head 1018 tail
> > 1002 cur
> > > 1010
> > > > > > > em0: Workaround: head 976 tail 973
> > cur
> > > 974
> > > > > > > em0: Workaround: head 950 tail 939
> > cur
> > > 946
> > > > > > > em0: Workaround: head 435 tail 419
> > cur
> > > 426
> > > > > > >=20
> > > > > > > Machine is still dead due to
> > random
> > > memory corruption which I see, in
> > > > > > > particular, pmap sometimes read
> > garbage
> > > from PTEs. I have no idea is
> > > > > > > it related to em0 rx descriptor
> > missed
> > > writes, or is a different issue.
> > > > > >=20
> > > > > > Humm, so if I'm reading this correctly,
> > the
> > > card "skips" a receive
> > > > > > descriptor and stores a packet at the
> > next
> > > descriptor?=9A That's just
> > > > > > bizarre.
> > > > > Either this, or it does store the packet but
> > > 'forgots' to update the
> > > > > rx descriptor. I think that your
> > interpretation is
> > > closer to reality,
> > > > > since I get sustained 20MB/s over ssh with
> > the
> > > patch even when workaround
> > > > > activates. The lost packets probably should
> > cause
> > > retransmit and speed
> > > > > drop.
> > > >=20
> > > > This is just weird.=9A I wonder if there is a
> > known
> > > errata for this?
> > > > This really seems to be broken hardware and not a
> > > driver issue.
> > > I was not able to find anything even remotely
> > resembling the
> > > described
> > > behaviour, in the publically available 82574L
> > specification
> > > update. I looked
> > > at rev. 3.5, dated January 2012.
> > >=20
> > > I may indeed give up and relocate the hardware into
> > trash,
> > > but it would be
> > > pity, since this is new shiny Intel Atom 2800 m/b. I am
> > not
> > > sure I can give
> > > convincing arguments to supplier for warranty
> > replacement.
> > >=20
> > > And, while I booted Debian to apply f/w fix Jack
> > > recommended, I did
> > > quick test and interface looked stable.
> > >=20
> > >=20
> >=20
> > FWIW, I've got an X7SPE-HF-D525 MB with 82574L running on a
> > 7.0 driver
> > that seems to work pretty well. It panics once in a blue
> > moon when we
> > overload it (like 200Mb/s of traffic) but it generally works
> > ok.
> >=20
> > BC
>=20
> Has anything been done or patched regarding this problem?

Yes, it was fixed by replacing the hardware (by the same model).

--8GpibOaaTibBMecb
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (FreeBSD)

iEYEARECAAYFAlAkxaAACgkQC3+MBN1Mb4hQWgCgn7gQMIJFo0Y+DuiLnm0WBdc7
h84AoJqsNNTQ57ouuQiFDuoVg230M8Ma
=/eWE
-----END PGP SIGNATURE-----

--8GpibOaaTibBMecb--

Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?20120810082608.GB2425>

Header And Logo

Peripheral Links

Site Navigation

Header And Logo

Peripheral Links

Search

Site Navigation