From owner-freebsd-current@freebsd.org Fri Mar 17 15:41:20 2017 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 11300D1084D for ; Fri, 17 Mar 2017 15:41:20 +0000 (UTC) (envelope-from ohartmann@walstatt.org) Received: from mout.gmx.net (mout.gmx.net [212.227.17.22]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7CC17180C; Fri, 17 Mar 2017 15:41:19 +0000 (UTC) (envelope-from ohartmann@walstatt.org) Received: from thor.intern.walstatt.dynvpn.de ([78.52.132.9]) by mail.gmx.com (mrgmx102 [212.227.17.168]) with ESMTPSA (Nemesis) id 0M2WgT-1bxUQr43Ft-00sPP9; Fri, 17 Mar 2017 16:41:09 +0100 Date: Fri, 17 Mar 2017 16:41:01 +0100 From: "O. Hartmann" To: Alexander Leidinger Cc: freebsd-current@freebsd.org, sbruno@freebsd.org, mmacy@nextbsd.org Subject: Re: CURRENT: massive em0 NIC problems since IFLIB changes/introduction Message-ID: <20170317164101.0518ac67@thor.intern.walstatt.dynvpn.de> In-Reply-To: <20170317141501.Horde.YTCr8GuMV2yI1YaUkdRTLlu@webmail.leidinger.net> References: <20170317122018.21384497@freyja.zeit4.iv.bundesimmobilien.de> <20170317141501.Horde.YTCr8GuMV2yI1YaUkdRTLlu@webmail.leidinger.net> Organization: WALSTATT User-Agent: OutScare 3.1415926 X-Operating-System: ImNotAnOperatingSystem 3.141592527 MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; boundary="Sig_/+oGlN1nCeMKHKa.CVT9+i2d"; protocol="application/pgp-signature" X-Provags-ID: V03:K0:1Z27hQPUqlFPeIQ83VQ4qXRy7wSjR+GI2PIbQJikKxCeCacLx+K 4+46a9QIuOLpg3n956eIydXPRQQdSSK+xubjRRRs1uNMMYMrbA1u5g4V64vc1K7d1K8bp84 0CDD0p1RB3osaUx60fgBtuHttKcaPPGyaGW78ZfBL2UpzhbtcJauAAwPjs/LAdsz4AAjxeM d+hDAdRYzHiZV24b0o4RA== X-UI-Out-Filterresults: notjunk:1;V01:K0:zLrOOxnKnhI=:eEhfqu7rhH8oUckB8lp1J8 2V5Z+JDQ9sEUioW59Ghl00NW5tP94Sx0AX51qKFdo3veuKDUNuxQlEPgKu3zSHYLpiYcFQBFt CbBABMxqwUTLgIS6LxsbJ0GwdFUxV5prwK01+tG0eN1HcyeHK3rcPnsx39hOBxNpMukZtJSr8 OIY/lZLE4q305AzRjtpiBnjHxT2MEAakEbqC0fEk+um6295VJPuTRtr0UkUpauZ2aLZDLiqd9 kFrSnbx1yTDFwhs714SIeT759daRKM7ru5UerkfF0lxgk9HIwXhxzctlIlwpsLBBN3Rb1zyJi NIIkFzRaRvLpsiuIHWVcOlj9jgdmmI04woWr6Xbv8qUlyElAaMGHS30gamHpwMPrzIn0TC/X9 bLIH4+TSx0uZqBXzym5tlBIi0WD67YS4fSuFg3WQlIKBf+AbDUs3Sf9kvhrDZRhm0UNk9IS3a 0xc4iF9y6BilSCwFr/vhQhyw27v/DEeRvGlhvScHbJwydb+GIlBEBMj6X9aipV4XoTZRfjnB3 dhidaCVtWYSNjfLL6PFZwo5aukPhBRxigcHTYHP+lT27GnGS56GyxYQWN9taS/f0FcLk0PSlH TBL/IXT66RoVX/RH/JtKFsoz5GsXJf9NdAieDasDtRTpVGgeDU62A+uUuYv0cbAeAF+QYfdhS /9VSZrWcJSslGBLmYsscuXQ1++PfqfCEoffSb6PzsPmWPgmAZ0fSs6P1FxBfvf7miOwVOoqIG zN8T1VOjKBErlWShf1XxaUgqcAbsiTyZrV9ln1IPCGkRtbe7IqNme59DzMk= X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 17 Mar 2017 15:41:20 -0000 --Sig_/+oGlN1nCeMKHKa.CVT9+i2d Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Am Fri, 17 Mar 2017 14:15:01 +0100 Alexander Leidinger schrieb: > Quoting "O. Hartmann" (from Fri, 17 Mar 2017 =20 > 12:20:18 +0100): >=20 > > Since the introduction of the IFLIB changes, I realise severe problems = on > > CURRENT. =20 >=20 > I already reported something like this to sbruno@ and M. Macy (in copy). >=20 > > Running the most recent CURRENT (FreeBSD 12.0-CURRENT #27 r315442: Fri = Mar 17 > > 10:46:04 CET 2017 amd64), the problems on a workstation got severe =20 > > within the > > past two days: > > > > since a couple of weeks the em0 NIC (Intel i217-LM, see below) dies on = heavy > > I/O. I realised this first when "rsync"ing poudriere repositories to a = remote > > NFSv4 (automounted) folder. The em0 device could be revived by =20 > > ifconfig down/up > > procedure. > > But not the i217-LM chip is affected. On another box equipted with a =20 > > i350 dual > > port GBit NIC I observed a similar behaviour under (artificially) =20 > > high I/O load > > (but I didn't investigate that further since it occured very seldom). = =20 >=20 > It's not only those chipsets. >=20 > It may be beneficial if you could provide the pciconf output for those =20 > devices. Mine is: > ---snip--- > em0@pci0:2:6:0: class=3D0x020000 card=3D0x13768086 chip=3D0x107c8086 =20 > rev=3D0x05 hdr=3D0x00 > vendor =3D 'Intel Corporation' > device =3D '82541PI Gigabit Ethernet Controller' > ---snip--- >=20 > > Now, since around yesterday, the i217-LM dies without being reviveable = with > > ifconfig down/up: Doing so, my FreeBSD CURRENT machine (Fujitsu Celsius= M740) =20 >=20 > I don't know if for the chip I see this issue with a simple down/up =20 > would help (it's a headless server in a remote datacenter). For the =20 > moment I'm using the workaround of something like "ping -C 1 =20 > || shutdown -r now" in crontab. >=20 > The system in question is at r314137. >=20 > > remains with a dead em0 device, reporting "no route" in some occasions = but > > stuck in the dead state. Every attempt to establish manually the route = again > > fails, only rebooting the box gives some relief. > > > > On the console, I have some very strange reports: > > > > - ping reports suddenly about no buffer space > > - or I see sometimes massive occurences of "em0: TX(0) desc avail =3D = =20 > > 1024, pidx > > =3D 0" on the console =20 >=20 > I don't see this in messages or console log, but I see that ntpd can't =20 > resolve hostnames in the logs. >=20 > > Either way, sending/receiving large files on an established network GBi= t line > > which could be saturated by approx 100 MBytes/s tend to make the NIC fa= il. =20 >=20 > I can report that the "svnlite update" on the box of of the FreeBSD =20 > src tree is able to trigger the issue in my case. >=20 > I have to add that before the iflib changes I've seen frequent =20 > em-watchdog timeouts in the logs / dmesg. So for me we have two issues =20 > here: > - the hardware wasn't 100% supported before the iflib changes (it seems) > - the iflib changes have lost some watchdog functionality / =20 > auto-failure-recovery feature >=20 > Bye, > Alexander. >=20 In January (18.01.2017), I reported Sean Bruno some strange behaviour of th= e same box alongside with some details (I forgort to send in the Email you're reposndi= ng to, sorry) of the hardware, so here it is again: [...] Again, here is the pciconf output of the device:=20 em0@pci0:0:25:0: class=3D0x020000 card=3D0x11ed1734 chip=3D0x153a8086 rev=3D0x05 hdr=3D0x00 vendor =3D 'Intel Corporation' device =3D 'Ethernet Connection I217-LM' class =3D network subclass =3D ethernet bar [10] =3D type Memory, range 32, base 0xfb300000, size 131072, ena= bled bar [14] =3D type Memory, range 32, base 0xfb339000, size 4096, enabl= ed bar [18] =3D type I/O Port, range 32, base 0xf020, size 32, enabled [...] The problem has become a severe state within the past two days. I did on a = daily basis CURRENT buildwords, did poudriere builds several times and tried to sync th= em to the package repository server - and that failed dramatically as described above= starting with yesterday. --=20 O. Hartmann Ich widerspreche der Nutzung oder =C3=9Cbermittlung meiner Daten f=C3=BCr Werbezwecke oder f=C3=BCr die Markt- oder Meinungsforschung (=C2=A7 28 Abs.= 4 BDSG). --Sig_/+oGlN1nCeMKHKa.CVT9+i2d Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- iLUEARMKAB0WIQQZVZMzAtwC2T/86TrS528fyFhYlAUCWMwDjQAKCRDS528fyFhY lBcyAf9I2Yyk7obblmKOyhvrIYxhWGkb+gpFXtkIlv9fi3SBy/YLbQZqbigI6eEU U1WoyR3CBV+vbhed5ZWC9gjfc7XfAf4/wPymjNpdBe+7IjO3ErstaWfM+LrDVbYU j61RoJEwG9S67gMzVJmjud+IOWtid/Tmr/OuTRmMPD9hwYJy0iLD =VsU/ -----END PGP SIGNATURE----- --Sig_/+oGlN1nCeMKHKa.CVT9+i2d--