Date: Tue, 24 Jan 2017 17:57:06 +0000 From: Eric Joyner <erj@freebsd.org> To: Daniel Genis <daniel.genis@gmx.de>, freebsd-stable@freebsd.org Subject: Re: intel 10gbe nic bug in 10.3 - no carrier Message-ID: <CA%2Bb0zg8pu7_K1Aqj3wifZb%2BzcB5iGshq7eVEMHKu30y994QzaA@mail.gmail.com> In-Reply-To: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de> References: <11f0e9e6-cfe7-1cc7-49a0-4bc42fd0f99a@gmx.de>
next in thread | previous in thread | raw e-mail | index | archive | help
On Tue, Jan 10, 2017 at 2:38 AM Daniel Genis <daniel.genis@gmx.de> wrote: > Hello everyone, > > we're trying to tackle a rare bug that is very hard to debug. > > Our 10.3-RELEASE servers can panic boot and subsequently can come up > without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0 we > have never seen this before. > > root@storage ~ # pciconf -lv | grep -B3 network > ix0@pci0:2:0:0: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 > rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' > class = network > -- > ix1@pci0:2:0:1: class=0x020000 card=0xd10f19e5 chip=0x10fb8086 > rev=0x01 hdr=0x00 > vendor = 'Intel Corporation' > device = '82599ES 10-Gigabit SFI/SFP+ Network Connection' > class = network > > Our network is configured as active/passive using lagg. (/etc/rc.conf): > > ifconfig_ix0="up" > ifconfig_ix1="up" > cloned_interfaces="lagg0" > ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1 10.1.2.31/16" > > After panic boot the network show up like this: > > ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO> > ether 60:08:10:d0:4e:9f > nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> > media: <unknown type> (autoselect) > status: no carrier > ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500 > > options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO> > ether 60:08:10:d0:4e:9f > nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> > media: <unknown type> (autoselect) > status: no carrier > > The network switch sees the connection as online. The LED's of the nic's > suggest the same, they see the network as online (led's are on like in > normal operation). Unplugging/replugging the network cable will get the > network online. Shutting the port on the switch and reenabling it wil > also get the network online. However another reboot will return the > machine into the no-carrier state. > > I've built various kernels trying to find where the regression is > without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and > subsequently the lagg code as well. I ported nic driver 3.1.14 from > pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting > lagg code from 10.2 did not make any difference. Rebooting with those > kernels the server remains in the no carrier state. > > We install our systems using mfsbsd for PXE boot. If I boot a machine > which has the "no carrier" state using the 10.3 PXE boot, both nic's > come online. If I then boot from disk again the machine returns into the > "no carrier" state. > > Recently we got some new machines, so we can shuffle more around and > also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and > configured it like always. Here interestingly one of the two nic's > entered the "no carrier" state, the other nic remained operational. This > remained persistent across reboots. > > The issue disappears after many reboots but it's not conclusive. I've > had 2 machines with which I could experiment with. > > On one the problem it disappeared on it's own after a reboot (kernel > 10.3-STABLE git hash d99ba5c aka r299900(?)). > > On the other one I pxe booted 10.1 live environment and then > subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka > r308456(?)). But I don't think anything can be concluded from that. That > was the machine which had both nic's online after booting into the 10.3 > pxe environment but subsequently returned into no carrier state when > booting 10.3 from disk. > > We also tried many sysctl flags (and many reboots), but without success. > For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0 > > At the moment I have no spare/empty machine in this state, we will empty > one machine though which currently has this state (but is in production > right now). > I don't know how to trigger this state manually, which doesn't help for > debugging. > > I could link reference where others report similar issues, for example > https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/ > Here they suggest that the new intel nic driver 3.1.14 fixes it. Though > I was not able to resolve the state by booting into a kernel with this > driver. > > If I can provide any additional information please do not hesitate to ask. > > Any tips and suggestions for debugging are most welcome! > > With kind regards, > > Daniel > _______________________________________________ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe@freebsd.org" > This is a late follow-up, but could you file this as a bug on bugs.freebsd.org? - Eric
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CA%2Bb0zg8pu7_K1Aqj3wifZb%2BzcB5iGshq7eVEMHKu30y994QzaA>