From owner-freebsd-stable@freebsd.org Tue Jan 10 10:37:50 2017 Return-Path: Delivered-To: freebsd-stable@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E4157CA9B00 for ; Tue, 10 Jan 2017 10:37:50 +0000 (UTC) (envelope-from daniel.genis@gmx.de) Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) (using TLSv1.2 with cipher DHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "mout.gmx.net", Issuer "TeleSec ServerPass DE-2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 519E61564 for ; Tue, 10 Jan 2017 10:37:49 +0000 (UTC) (envelope-from daniel.genis@gmx.de) Received: from [192.168.101.17] ([80.113.31.106]) by mail.gmx.com (mrgmx003 [212.227.17.190]) with ESMTPSA (Nemesis) id 0MgtWa-1c4x1Y1cJc-00M4kV for ; Tue, 10 Jan 2017 11:32:28 +0100 To: freebsd-stable@freebsd.org From: Daniel Genis Subject: intel 10gbe nic bug in 10.3 - no carrier Message-ID: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de> Date: Tue, 10 Jan 2017 11:32:29 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.5.1 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Provags-ID: V03:K0:uYcZpEZESBY9RTxyCGITAWu/4NW/Gwfo+Dl9xGCy9hOg3spZYyH /ZEA31l8/7KfkRC8dnpbjIyctJusxECnzhOyjXWJwFSoKQLbsUJyFm/jfxVur4pWqGh6tnk w/aQNBIanWMK/NaSngF31iuH9m9+bPwhpLE1zX1es0j1BZPpdTsIuoEWKloxs577W3jlMiF L1c1Uj5bcGYXjyJ2sYD2Q== X-UI-Out-Filterresults: notjunk:1;V01:K0:Nd75CXk29V0=:dLMgbyRRclC5hrq3WYstGa Dsh/n2yO3ruI3mNx4E96GikRgWJdXgFwgqb9beAGJYc7pSDA1kjWZPvXR+kBSY4LhbdhaqS1a i9vB7lj4iw3MYCPSrcXeBGwB2I3mFYPEyDTs38DbwWKwLEZU50c0hAeYGJv2pxilZTQ1HmF/v +hWi7XBXo+2dO8MNPRaG4p3I6Pom1+vUh8X4xpT8LpdZceNZY54WD0Ilqv8ctRLRwQzXJ/Nc7 NFZ307Yb4rYEN8AcJ60cKubS1W46eNDpKiNrxwiSiEl2dYFKP7PbagUTrrbApjnUdrGEPrBUO Yfd8Ok3OV/XXq0cJ3enDhEgm2QtW/mz66vnly+wsu/NpgFEcCaVglALQi7kJV4rwgob+PbCEf hlYrMt2XrB2PajdJmgq5P5aacmux3+BJ+iqPdudSE0Gfl6AqRSTtF/oruW6Znmctb2j5sslpi frJP1+w3hlM5a105BYVw9shOSBNYTz9Ucf5Gl+aJVj9CvapPGX4J0er9QpLQuzVGs+vTYZVqc eir/aaBgqTr0TPahmNkBpzZgRDcoafsvuas9bA9WaXYpkdfx5RFoyT5HqlLeiYHKu2kvGqxgk 6ocz6uuuJuGqMaYkKIv3Y6DeQOa8kGyl3p56xpBd0TkAL5/rH0oo/8MklGvoYn0C1tQMBrupd 94rNWHFTdYh80azzyt0WkMbSzF2qOpR0HIrSz7Pp1jcWmJgPbjvHbmphDoahqrRuoVOpHF1jJ ep9YIapxnt45RXH/GguQcPTWlKJJT+S6pqTHt9fUHPQAZKm8HJO56jQyCQGjtWznf70ZSnroO 9oKgRxI X-BeenThere: freebsd-stable@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Production branch of FreeBSD source code List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 10 Jan 2017 10:37:51 -0000 Hello everyone, we're trying to tackle a rare bug that is very hard to debug. Our 10.3-RELEASE servers can panic boot and subsequently can come up without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0 we have never seen this before. root@storage ~ # pciconf -lv | grep -B3 network ix0@pci0:2:0:0: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' class = network -- ix1@pci0:2:0:1: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 rev=0x01 hdr=0x00 vendor = 'Intel Corporation' device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' class = network Our network is configured as active/passive using lagg. (/etc/rc.conf): ifconfig_ix0="up" ifconfig_ix1="up" cloned_interfaces="lagg0" ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1 10.1.2.31/16" After panic boot the network show up like this: ix0: flags=8843 metric 0 mtu 1500 options=8407bb ether 60:08:10:d0:4e:9f nd6 options=29 media: (autoselect) status: no carrier ix1: flags=8843 metric 0 mtu 1500 options=8407bb ether 60:08:10:d0:4e:9f nd6 options=29 media: (autoselect) status: no carrier The network switch sees the connection as online. The LED's of the nic's suggest the same, they see the network as online (led's are on like in normal operation). Unplugging/replugging the network cable will get the network online. Shutting the port on the switch and reenabling it wil also get the network online. However another reboot will return the machine into the no-carrier state. I've built various kernels trying to find where the regression is without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and subsequently the lagg code as well. I ported nic driver 3.1.14 from pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting lagg code from 10.2 did not make any difference. Rebooting with those kernels the server remains in the no carrier state. We install our systems using mfsbsd for PXE boot. If I boot a machine which has the "no carrier" state using the 10.3 PXE boot, both nic's come online. If I then boot from disk again the machine returns into the "no carrier" state. Recently we got some new machines, so we can shuffle more around and also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and configured it like always. Here interestingly one of the two nic's entered the "no carrier" state, the other nic remained operational. This remained persistent across reboots. The issue disappears after many reboots but it's not conclusive. I've had 2 machines with which I could experiment with. On one the problem it disappeared on it's own after a reboot (kernel 10.3-STABLE git hash d99ba5c aka r299900(?)). On the other one I pxe booted 10.1 live environment and then subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka r308456(?)). But I don't think anything can be concluded from that. That was the machine which had both nic's online after booting into the 10.3 pxe environment but subsequently returned into no carrier state when booting 10.3 from disk. We also tried many sysctl flags (and many reboots), but without success. For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0 At the moment I have no spare/empty machine in this state, we will empty one machine though which currently has this state (but is in production right now). I don't know how to trigger this state manually, which doesn't help for debugging. I could link reference where others report similar issues, for example https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/ Here they suggest that the new intel nic driver 3.1.14 fixes it. Though I was not able to resolve the state by booting into a kernel with this driver. If I can provide any additional information please do not hesitate to ask. Any tips and suggestions for debugging are most welcome! With kind regards, Daniel